Shelly Plamer — Courtesy of Shelly Palmer

... Agent Goals, RAG, Agent Architecture & Models

By Lavanya Subbarayalu, Principal Solution Architect, PwC

Designing Enterprise Agentic AI Systems

Enterprise adoption of Agentic AI requires more than connecting large language models to tools. Production systems should coordinate reasoning, retrieval, multi-agent collaboration, tool execution, and runtime orchestration while maintaining reliability, security, and operational visibility.

Many early implementations fail because core architectural decisions are implicit rather than designed deliberately. Without clear boundaries between reasoning, coordination, execution, and governance layers, multi-agent systems quickly become fragile, difficult to evaluate, and expensive to operate.

This document provides practical guidance for architects and AI engineers designing enterprise-grade Agentic AI systems.

The framework organizes Agentic AI design into eight architecture areas.

Part 1 focuses on the foundational design areas, which define how agents reason, retrieve information, and interact with enterprise systems:

Agent goals, skills, and context management
Retrieval Augmented Generation (RAG) design
Agent architecture patterns and frameworks
Models, tools, and integration strategies

Part 2 covers the operational layers required to run agent systems in production, including communication protocols, evaluation, infrastructure, reliability, and security.

Enterprise Agentic AI Reference Architecture

The following diagram illustrates the Agentic AI reference architecture used throughout this design guidance.

Figure 1: Enterprise Agentic AI reference architecture

The diagram illustrates the runtime execution flow and supporting operational components used in Enterprise Agentic AI systems.

Numbered boxes represent the primary runtime execution path.
Solid arrows represent runtime execution.
Dashed arrows represent asynchronous data flows and feedback loops.
Left panels represent data pipelines.
Right panels represent evaluation and governance capabilities.

0. Should You Build an Agentic System?

Before designing an agent architecture, the suitability of Agentic AI for the problem should be validated. The use case can be evaluated across the following dimensions.

Dimension	Traditional AI Signal	Agentic AI Signal
Task Complexity	Single step: classify, predict, extract	Multi-step: plan → execute → verify → adapt
Decision Autonomy	Human directs every action	System makes intermediate decisions within guardrails
Tool Usage	0–2 fixed integrations, hard-coded	3+ tools, dynamically selected based on context
Workflow Shape	Linear, deterministic pipeline	Branching, loops, context-dependent paths
Error Handling	Fail fast, human retries	Self-corrects, tries alternatives, escalates intelligently

Decision Guidance (Illustrative):

Score each dimension 1 (traditional) to 5 (agentic). Total >20 → agentic justified. Total 12–20 → hybrid (agentic for specific sub-flows only). Total <12 → traditional AI.

⚠ Anti-pattern

Building agentic systems for simple classification or single-API tasks. Linear and deterministic workflows are better served by traditional pipelines.

0.1 Design Principles for Enterprise Agentic AI

Enterprise agent systems introduce new architectural complexity. The following principles help keep systems reliable, scalable, and maintainable.

Separate reasoning from execution
Planning, coordination, and execution should be implemented as distinct layers.

Separate orchestration from agent tasks
Orchestration manages workflow state while agents focus on reasoning and task execution.

Treat skills as contracts
Agent skills should expose structured inputs, outputs, and error handling.

Constrain agent autonomy
Agents should operate within defined tool permissions, task scope, and safety policies.

Design feedback loops early
Evaluation, human review, and telemetry should continuously improve the system.

Optimize model usage
Model routing or cascading can be used to balance capability, latency, and cost.

1. Agent Goals, Skills & Context

This area defines what the agent is trying to accomplish (goals), what discrete capabilities it has (skills), and how it manages information needed to act (context).

1.1 Goal Design

Every agent should have exactly one measurable primary goal. Multiple competing objectives lead to unpredictable prioritization. The goal should be decomposable into a sequence of skills that can be individually validated.

Goal Property	Requirement	Example
Measurability	Quantifiable success metric	Resolution rate >95% within 2 turns
Decomposability	Expressible as ordered skill sequence	Intake → Verify → Route → Resolve
Scope	Bounded — agent can fail gracefully	Single customer issue per session
Observability	Every step is loggable and traceable	Each skill invocation has correlation ID

1.2 Skill Design & Contracts

Skills are atomic capabilities. Each skill has a defined contract: what it accepts, what it returns, and how it fails. This contract enables independent testing and composition.

The Skill Contract Schema:

Contract Element	Description	Example
Name	Unique identifier, verb-noun format	verify_identity
Input Schema	Typed input specification (JSON Schema)	{“customer_id”: “string”, “method”: “enum[email,sms,push]”}
Output Schema	Typed success response	{“verified”: “bool”, “confidence”: “float”, “evidence”: “string”}
Error Codes	Enumerated failure modes	NOT_FOUND, TIMEOUT, LOW_CONFIDENCE, RATE_LIMITED
SLA	Timeout and reliability targets	Timeout: 5s, Success rate: 99.5%
Fallback	What happens on failure	Try secondary provider → escalate if both fail

How Many Skills Per Agent?

Skill Count	Action	Rationale
1–3	Single agent, flat structure	Low complexity, minimal orchestration
3–5	Single agent, optimal range	Sufficient capability without coordination overhead
6–8	Consider splitting into sub-agents	Orchestration complexity increases significantly
>8	Decompose — hierarchical routing required	Single agent cannot reliably manage >8 skills

⚠ Anti-pattern

Assigning 10+ skills to one agent. Orchestration complexity grows with skill count. At >8, route to specialized sub-agents instead.

1.3 Skill Routing & Invocation

When an agent receives a user intent, it should decide which skill to invoke. This routing decision is critical — incorrect routing wastes a full skill execution and may produce incorrect results.

Routing Method	How It Works	When to Use	Accuracy
Semantic Similarity	Embed intent, cosine-match to skill descriptions	General-purpose, <20 skills	85–92%
Intent Classifier	Dedicated fine-tuned classifier model	>20 skills, high-volume	92–97%
Keyword Rules	Regex/keyword matching to skill triggers	Simple, well-defined domains	95%+ if domain is narrow
LLM Router	Frontier model selects skill from description set	Complex ambiguous intents	88–94%, higher latency

Fallback Trigger Logic:

Route fails → retry with next-best match (if confidence gap <0.1) → if still failing, escalate to human. Circuit breaker: 3 consecutive failures on same skill → mark as degraded, bypass in routing.

Routing Accuracy Measurement

Routing accuracy = (correct skill selections / total routing decisions) × 100

This metric should be evaluated against a labeled intent dataset and reviewed when skills or routing logic change.

⚠ Anti-pattern

Using an expensive frontier model for every routing decision. For <20 skills with clear descriptions, semantic similarity or a small, fine-tuned classifier is 5–10x cheaper and often more accurate.

Note: Frontier models: the most advanced and capable LLMs (e.g., GPT-class, Claude- class or flagship models). These offer the best reasoning but are the most expensive.

1.4 Context & Prompt Management

Context is the information available to an agent when making decisions or executing skills. Managing context is a resource allocation problem: insufficient context prevents action; excessive context degrades quality, increases cost, and risks hitting token limits.

Context Budget Calculation

Step 1 — Determine Model Window: The model’s maximum context window should be obtained from provider documentation (e.g., 32K, 128K, 200K tokens).

Step 2 — Calculate Token Budget: Typically, maximum 20% of the window to context can be allocated. Example: 128K window → 25.6K token budget for context.

Step 3 — Allocate Context Budget: The context budget should be divided among conversation turns, retrieved chunks, and skill instructions. Typically, the remaining 80% is reserved for the model’s reasoning and output generation.

The allocations given below are illustrative starting points and should be adjusted based on use case, retrieval needs, and model behavior.

Model Window	Illustrative Context Budget	Example Allocation (Indicative)
32K tokens	6.4K tokens	~3 conversation turns + 1 retrieved chunk
128K tokens	25.6K tokens	~5 turns + 3–4 retrieved chunks + skill instructions
200K tokens	40K tokens	~8 turns + 5–6 chunks + full skill suite + examples

Note: The 20% guideline reserves 80% for the model’s reasoning, skill instructions, and output generation. Exceeding this threshold consistently can lead to quality degradation — models may truncate or ignore later context.

Hot vs Cold Context

Context Type	Where Context is Stored	What It Contains	Retrieval
System Prompt	In-context (always)	Agent identity, skill list, behavioral rules	Not retrieved (pre-injected)
Hot Context	In-context (dynamic)	Last 3–5 turns, current task state	Maintained per session
Cold Context	Redis (TTL 24–72hr)	Previous task outcomes, user preferences	Retrieved on demand via similarity search
Knowledge Base	Vector DB	Domain documentation, policies	Retrieved per skill invocation

Prompt Engineering Techniques

While many prompt techniques exist, the following represent commonly used approaches in enterprise agent systems.

Technique	When to Use	Trade-off
Chain-of-Thought	Multi-step reasoning, math, logic	+25% accuracy vs costs 2× tokens
Few-shot (3–5 examples)	Consistent output formatting	+20% format compliance vs adds ~2K tokens cost
Role Prompting	Define agent scope, authority, and boundaries	Reduces drift vs minimal overhead
Goal-Oriented Prompting	Clear objectives with measurable outcomes	Improves consistency vs may reduce flexibility
Task Planning Prompts	Multi-step workflows requiring structured execution	Better control & structure vs adaptability (may be rigid)
Self-Correction	High-stakes decisions (financial, medical)	+10% reliability vs adds 50% latency per step

Decision Guidance:

Choose techniques based on the task’s accuracy requirements and acceptable cost increase. Chain-of-Thought is valuable for complex reasoning; few-shot ensures format consistency; self-correction adds reliability for critical decisions.

Structured Output Enforcement: Skill outputs should be validated against the defined JSON Schema. Use the model’s built-in structured output mode (e.g., response_format in OpenAI, or tool_use forcing in Anthropic) rather than post-hoc regex parsing. Post-hoc parsing fails on edge cases.

⚠ Anti-pattern

Free-form text outputs from skills that downstream skills should parse. Every skill output should be schema-validated. Unstructured outputs are a common cause of cascading failures in agent pipelines.

2. RAG, Memory & Vector Search

Retrieval-Augmented Generation (RAG) gives agents access to knowledge beyond their training data. Memory provides continuity across interactions. Together, these form the knowledge layer. This area covers pattern selection, component configuration, and production considerations.

2.1 RAG Design Decision Framework

The need for RAG should be evaluated based on the following criteria.

The thresholds and ranges presented below are example starting points and should be calibrated based on domain requirements, system constraints, and operational data.

Evaluation Criteria	No RAG Needed	RAG Required
Use Case Complexity	Single-turn Q&A, knowledge fits in prompt (<5K tokens)	Multi-document synthesis, knowledge exceeds prompt capacity
Data Freshness	Static knowledge, updates <monthly	Frequently updated content (daily/weekly)
Data Volume	Small corpus (<50 documents)	Large corpus (>100 documents)
Query Type	Broad conceptual questions	Specific fact retrieval, citation needed
Precision Requirements	Best-effort answers acceptable	Should be grounded in specific sources

Decision Guidance (Illustrative):

When multiple criteria indicate that retrieval is required, RAG-based approaches are typically considered. When criteria consistently indicate that retrieval is not required, alternatives such as few-shot prompting or fine-tuning may be sufficient.

These conditions should be validated against domain requirements and evaluation results rather than treated as fixed rules. The thresholds given in above table are indicative starting points and should be calibrated based on domain complexity and risk tolerance.

Note: When the domain is narrow and well-defined with stable structured data, consider direct database queries or API calls instead of RAG. RAG is optimized for unstructured text retrieval, not structured data lookups.

2.2 RAG Pattern Selection

There are 25+ RAG patterns documented in research. These include patterns such as Standard RAG, Hybrid RAG, Self-RAG, Corrective RAG, Multi-hop RAG, Fusion RAG, and Agentic RAG. The commonly used production patterns fall into these categories based on complexity and quality requirements:

Pattern Category	Representative Patterns	When to Use
Basic Retrieval	Standard RAG, Sparse RAG, Constrained RAG	Fast fact lookup, protocol retrieval, <500ms latency
Advanced / Hybrid Retrieval / Query Optimization	Hybrid RAG (semantic + keyword), Fusion RAG, Adaptive RAG	Mixed technical and conceptual queries, variable query complexity, common in production
Context & Conversation	Conversation RAG, Memory-Augmented RAG, Contextual RAG	Multi-turn interactions, follow-up questions, session-aware retrieval
Quality & Reliability	Corrective RAG, Self-RAG, Citation-Aware RAG	High-stakes domains, auditability, verification of retrieved evidence is required
Multi-Document Reasoning	Iterative RAG, Multi-hop RAG, Hierarchical RAG, Chain-of-Retrieval RAG	Synthesis across multiple sources, comparative analysis, evidence chaining
Scale & Performance	Federated RAG, Long-Context RAG	Multi-system retrieval, large corpora, distributed knowledge environments
Agentic Retrieval	Agentic RAG, Speculative RAG, RL-RAG, ReFeed RAG	Complex workflows where retrieval should be planned, refined, or optimized over time
Specialized Domain Retrieval	Multimodal RAG, Reasoning RAG, Few-shot RAG, Prompt-Augmented RAG	Mixed-modality inputs, structured output generation, domain-specific reasoning

The patterns listed above represent the most used production patterns. More advanced patterns should be adopted when complexity and quality requirements justify them.

RAG Pattern Selection Criteria

Factor	Consideration
Query complexity	Single-step lookup vs. multi-step evidence synthesis
Source requirements	Single trusted source vs. multiple sources requiring comparison or validation
Freshness requirements	Static content vs. frequently updated content
Latency tolerance	Real-time retrieval vs. multi-stage retrieval and reranking
Reliability requirements	Best-effort responses vs. grounded, auditable, citation-backed outputs
Modality	Text-only vs. multimodal inputs such as images, forms, or scanned documents

Production Guidance (Illustrative):

Hybrid RAG (semantic + keyword retrieval) is commonly used as a production baseline. A typical starting point is a weighted combination (for example, 0.7 semantic / 0.3 keyword), which should be calibrated based on query distribution and domain characteristics.

Precision and Recall in Retrieval

Precision: Of the documents retrieved, how many are relevant? High precision minimizes noise.

Recall: Of all relevant documents in the corpus, how many were retrieved? High recall ensures completeness.

Trade-off: Retrieval systems are typically optimized for recall to ensure relevant candidates are surfaced, while reranking improves precision by ordering results based on true relevance.

Production systems should explicitly balance recall and precision based on use case requirements. High-recall systems require reranking to maintain answer quality.

Retrieval Quality Measurement

Precision@k = relevant retrieved documents / k

Recall@k = relevant retrieved documents / total relevant documents

These metrics should be benchmarked on domain-specific queries prior to production deployment.

2.3 Embedding Model Selection

The embedding model directly affects retrieval quality. This decision is frequently underestimated — using an inappropriate embedding model is a common cause of poor RAG performance.

The guidance below should be treated as indicative starting points rather than fixed rules. Embedding model selection should be benchmarked against domain queries, retrieval latency targets, storage constraints, and update frequency.

Selection Criteria	What to Evaluate	Guidance
Dimension Size	Vector dimensionality (384–1536+)	Higher dimensions = better quality, more storage & slower search. 768–1024 is a common practical balance.
Domain Fit	Performance on domain-specific queries	Embedding models should be benchmarked on at least 100 domain queries before selection. General models degrade into specialized domains.
Speed vs Quality	Encoding latency per batch	Example: For <100ms retrieval SLO, use lighter models (384-dim). For quality-critical paths, accept 200–400ms with 1024-dim.
Update Cost	Re-embedding cost when docs change	If data updates frequently, choose models with incremental update support.

2.4 Chunking Strategy

Chunking strategy directly affects retrieval accuracy. When the size is too small, crucial information might be overlooked; conversely, if it’s too large, excessive noise can interfere with retrieval.

Additional chunking mechanisms may also be used depending on document structure and retrieval objectives:

Semantic chunking: split on meaning or topic boundaries rather than fixed size
Sliding-window chunking: overlapping windows for continuity across adjacent spans
Hierarchical chunking: summary-level parent chunks linked to detailed child chunks
Structure-aware chunking: split on headings, tables, sections, or form boundaries

The chunk sizes in the table are illustrative starting points and should be calibrated using retrieval quality metrics such as precision@k, recall@k, and answer relevance and based on document structure, and model context limits.

.Content Type	Chunk Size (tokens)	Overlap	Strategy
Structured data (logs, records)	100–200	0%	Each record is atomic and should not be split across chunks
Technical documentation	300–500	15–20%	Split at section boundaries; overlap preserves cross-section references
Narrative or conversational content	500–800	20%	Preserve topic continuity across paragraphs
Legal / policy documents	400–600	25%	Higher overlap because clauses reference each other

Note: Overlap is not free — it increases index size and can cause duplicate retrievals. Retrieved chunks should be deduplicate by source ID before passing to the model.

2.5 Retrieval & Reranking

RAG pattern determines retrieval method. For instance, Hybrid RAG requires both semantic (vector) and keyword (sparse) retrieval working together. The pattern dictates the retrieval architecture.

Initial retrieval (via vector similarity) is a recall operation — it surfaces candidates. Reranking is a precision operation — it reorders candidates by relevance. The two-stage approach (retrieve-then-rerank) is the production standard.

Stage	Method	Speed	Precision	When to Use
Retrieval	Vector similarity	Fast (<50ms)	Moderate	Used as stage 1
Reranking	Cross-encoder (pairwise scoring)	Slower (+100–300ms)	High	Top-K ≥ 5 candidates, quality-critical paths
Skip Reranking	Use retrieval output directly	Fastest	Moderate	Latency-critical paths (<100ms SLO), low-stakes

Retrieval Quality Measurement

Precision@k = relevant retrieved documents / k

Recall@k = relevant retrieved documents / total relevant documents

These metrics should be benchmarked on domain-specific queries prior to production deployment.

2.6 Index Maintenance & Drift

A vector index is not a static artifact — it degrades as source documents change. Index drift (retrieving stale or outdated content) degrades performance gradually and can be difficult to detect.

Source Update Frequency	Recommended Strategy	Implementation
Daily or less	Batch re-index nightly	Full re-embed during low-traffic window; swap atomically
Hourly	Incremental upsert	Track document versions; re-embed only changed docs; append to index
Real-time (<1 min)	Streaming pipeline	Kafka → embed worker → upsert to vector DB. Accept eventual consistency.
Rare (monthly+)	Manual trigger	Re-index on content release; validate with benchmark queries

2.7 Memory Architecture

Memory gives agents continuity. Without it, every interaction starts from scratch. Memory should be architected as a system — the wrong memory architecture leads to stale context, privacy violations, or unbounded storage growth.

Memory Type	Storage	Scope	TTL / Lifecycle	Use Case
Session (Short-term)	In-context window	Current conversation	Session end	Conversation flow, turn tracking
Working (Workspace)	Redis	Active task	24–72 hours	Multi-step task state, intermediate results
Episodic (Long-term)	Postgres (encrypted)	Per-user persistent	Regulatory policy (e.g., 7 years)	User history, past decisions, preferences
Semantic	Vector DB	Cross-user knowledge	2 years, periodic cleanup	Pattern recognition, experience sharing

Decision Guidance:

Working memory and episodic memory should be stored separately to prevent contamination of task state and historical records.
Episodic memory should be written only after validation to avoid propagating incorrect information.
Retrieved memory should be constrained to a small portion of the context window (for example <5%) to prevent context overflow.

⚠ Anti-pattern

Storing raw conversation transcripts as long-term memory. Transcripts grow unboundedly and contain noise. Extract and store structured summaries instead.

3. Agent Architecture, Patterns & Frameworks

This area covers structural decisions: how the agent system is organized as software, what reasoning patterns it uses, and whether to use an agent framework or build directly against model APIs. These decisions determine operational complexity, failure behavior, and long-term maintainability.

3.1 Event-Driven Microservices Architecture

Event-driven microservices are the de facto standard for production multi-agent systems at scale. This architecture provides fault isolation, independent scaling, and asynchronous workflows. The conceptual reference architecture consists of these layers:

Layer 1 — API Gateway: Authentication, rate limiting, request routing.

Layer 2 — Agent Services: Each agent type runs as an independent service (stateless or stateful with external state store)

Layer 3 — Event Bus (Kafka/SQS): Asynchronous communication between agents and services

Layer 4 — LLM Gateway: Model routing, prompt caching, token metering

Layer 5 — Data Layer: Vector DBs, Redis (working memory), Postgres (episodic memory), tool registries.

These layers represent a deployment view of the architecture. At runtime, the system follows the flow illustrated in the reference architecture diagram (planner → orchestration → agents → coordination → execution).

Key Design Principle: Agents publish events when tasks are complete; other agents subscribe and react. This decouples services and enables independent deployment.

3.2 Agents State Management

Stateless agents are simpler to operate (horizontal scaling, no sticky sessions, simple health checks). Use stateful approaches only when the use case requires conversation continuity or multi-step task tracking.

Approach	When to Use	State Store	Complexity
Stateless	Single-turn Q&A, classification, extraction	None	Low — fully horizontally scalable
Stateful (external)	Conversational agents, multi-step workflows	Redis (session) + Postgres (persistent)	Medium — state store becomes a dependency
Stateful (in-context)	Short sessions (<10 turns), low volume	Context window only	Low, but limited by context window size

⚠ Anti-pattern

Storing state in-process (agent instance memory). When the instance restarts or is load-balanced to a different pod, state is lost. Agent’s state should be externalized for production systems.

3.3 Handling Partial Failures

In multi-skill or multi-service agent workflows, partial failures are inevitable. If Skill A succeeds (example writes to a database) and Skill B fails (example: payment times out), the system is in an inconsistent state.

Saga patterns are commonly used in distributed agent workflows to manage long-running tasks and compensation logic when partial failures occur. Other recovery patterns, such as checkpoint-and-resume and idempotent retry, may be more appropriate depending on workflow duration, side effects, and rollback requirements.

State recovery mechanisms (such as checkpointing, retries, and compensation logic) should be selected based on workflow duration and failure impact.

Pattern	How It Works	When to Use
Saga (Choreography)	Each skill publishes success/failure events; other skills listen and compensate	Loosely coupled skills, async workflows
Saga (Orchestration)	A central orchestrator coordinates and triggers compensation on failure	Tightly coupled workflows, need clear ownership
Retry with Idempotency	Retry failed operations; each operation is idempotent (safe to repeat)	Simple failures (timeouts, transient errors)
Checkpoint & Resume	Save workflow state at each step; resume from last checkpoint on failure	Long-running workflows (minutes+)

Compensation Design Guidance: Every skill that performs a side effect (write, send, charge) should have a corresponding compensation action (rollback, cancel, refund). Define these at contract time — not at incident time.

3.4 Planning & Reasoning Patterns

Agents Pattern Overview

Multiple Agent patterns can be used across multiple layers of system behavior:

Interaction patterns define how agents communicate with users, tools, and other agents
Planning patterns define how tasks are decomposed before execution
Reasoning patterns define how decisions are made
Execution and coordination patterns define how tasks are carried out across workflows

Part-1 focuses on planning, reasoning, and interaction patterns. Execution and coordination patterns are covered in Part-2.

Planning Patterns

Pattern	Best For	Tradeoff
Plan-and-Execute	Predictable workflows with defined steps	Fast and structured, but less adaptive
Least-to-Most / Task Planning	Problems that can be decomposed into ordered sub-problems	Handles structured complexity, but adds planning overhead
Hierarchical Planning	Multi-level task decomposition and delegated sub-goals	Scales to complex workflows, but increases coordination overhead
Dynamic Replanning	Volatile workflows where conditions change during execution	Adaptive, but can increase compute cost and plan instability

Reasoning Patterns

The reasoning pattern determines how an agent breaks down and solves a problem. The choice affects accuracy, latency, and cost.

Selection Criteria:

Task Predictability:

Predictable or known workflows are typically suited to Plan-and-Execute patterns, while uncertain or adaptive workflows align with ReAct-style reasoning.

Solution Space:

Tasks with a single expected outcome are generally suited to Chain-of-Thought reasoning, whereas problems with multiple valid paths may benefit from Tree-of-Thoughts.

Quality Requirements:

Standard workflows may use single-pass reasoning, while high-stakes scenarios often use Reflection or multi-pass approaches.

Cost Tolerance:

Each reasoning iteration incurs an additional model call. Iteration limits are typically defined based on acceptable cost and latency constraints.

Additional considerations include latency tolerance, tool interaction complexity, and whether the workflow involves single-agent or multi-agent coordination.

Pattern	Best For	Tradeoff
ReAct	Adaptive, uncertain environments	Flexible, but can loop without iteration bounds
Chain-of-Thought	Step-by-step reasoning for logic or calculation	Improves reasoning quality, but increases token usage
Tree-of-Thoughts	Problems with multiple viable solution paths	Stronger exploration, but much higher compute cost
Reflection/Self-Critique	Quality refinement and validation	Improves output quality, but adds latency and cost
CodeAct	Tasks requiring code generation and execution	Enables precise computation and automation, but requires sandboxing and execution safeguards

Additional reasoning patterns exist, but the patterns above represent the most commonly used approaches in enterprise agent systems. Selection should be guided by task complexity, quality requirements, latency tolerance, and tool-use needs.

Note: ReAct iteration limits: Set max iterations based on cost tolerance. Each iteration costs one full LLM call. For example: for 10 iterations with a frontier model, cost can exceed $0.10 per query. Iteration counts should be monitored in production – consistently high counts indicate the agent is not making progress.

Interaction & Tool Use Patterns

Tool interaction patterns define how agents invoke external capabilities, including function calling, API orchestration, and tool chaining.

3.5 Agent Framework Selection

Agent frameworks (LangChain, LlamaIndex, AutoGen, CrewAI) abstract common patterns: tool calling, memory management, multi-agent coordination.

Direct implementation means custom code calling LLM APIs (OpenAI, Anthropic Claude, etc.) without framework abstraction.

Framework adoption is an important architectural decision. Frameworks simplify development by abstracting common agent patterns and accelerating implementation, but they also introduce an additional layer that can reduce transparency and fine-grained control in complex workflows.

Framework Selection Criteria

Factor	Framework	Direct Implementation
Use Case Coverage	>80% of patterns are standard (RAG, tool-use, chat)	Custom patterns the framework does not support
Development Speed	Team is new to Agentic AI, and needs rapid prototyping	Team has LLM API experience, optimizing for performance
Latency Requirements	Can tolerate <20% overhead from abstraction layer	Need <100ms end-to-end, every millisecond matters
Debugging Complexity	Can accept framework abstraction in stack traces	Need full visibility into every LLM call and token
Ecosystem Needs	Need 100+ pre-built integrations (databases, APIs, tools)	Custom integrations only, no ecosystem dependency

Decision Guidance (Illustrative):

Framework-based approaches are generally effective when most required capabilities are supported and the operational overhead remains within acceptable limits.

Direct implementation may be more suitable when workflows require highly specialized execution patterns, strict latency constraints, or full control over model interactions.

These thresholds should be validated against real workloads rather than treated as fixed rules.

When Frameworks vs Direct Implementation

Use Frameworks When: Building standard RAG pipelines, multi-agent coordination, or tool-heavy workflows where pre-built integrations save significant development time. Frameworks are the de facto choice for most production agent development.

Use Direct Implementation When: Latency is critical (<100ms SLO), custom reasoning patterns are required, or the team needs full control over every LLM call for optimization and debugging. Examples: high-frequency trading agents, real-time decision systems.

Framework Comparison

Framework	Primary Strength	Best For
LangChain	Broadest ecosystem, most integrations	General-purpose agents, tool-heavy workflows
LlamaIndex	RAG-optimized data pipeline	Knowledge-intensive agents, document Q&A
AutoGen	Multi-agent conversation patterns	Agent-to-agent workflows, debate/consensus
CrewAI	Role-based multi-agent teams	Structured team workflows, task delegation

The frameworks listed above represent commonly used options and are not exhaustive.

Multi-Frameworks Strategy:

It is valid to combine frameworks for different layers. Example: LlamaIndex for RAG pipeline + LangChain for agent orchestration + direct API calls for latency-critical skills. The key is clear boundary ownership — each layer has one framework responsible.

⚠ Anti-pattern

Choosing a framework based on popularity without benchmarking against actual use cases. Every framework has domains where it excels and domains where it adds friction. Benchmark with real use cases.

4. Models, Tools & Integration

This area covers the components an agent calls: language models (reasoning engine), tools (action layer), and the infrastructure connecting them (LLM Gateway). The central challenge is cost optimization without sacrificing quality — model costs are typically the largest operating expense.

4.1 Model Selection & Cascading

No single model is optimal for all tasks. Small models are fast and economical but limited in complex reasoning. Frontier models are powerful but expensive. Model cascading routes each query to the cheapest model that can handle it reliably — this is one of the highest ROI cost optimizations for agent systems, alongside prompt caching and tool parallelization.

Tier	Characteristics	Relative Cost	Best For
Small LM (7B–13B)	Fast inference (<200ms), limited reasoning	~100× cheaper than frontier	Classification, extraction, simple formatting
Mid-tier LM (70B or equivalent)	Good reasoning, moderate speed	~10× cheaper than frontier	Standard Q&A, tool selection, summarization
Frontier LM (GPT-4 class)	Best reasoning, slower	Baseline cost	Complex multi-step reasoning, ambiguous queries
Specialist / Fine-tuned	Domain-specific accuracy	Training cost amortized	High-volume, narrow-domain tasks

Model Routing vs Cascading

Model routing selects the most appropriate model based on task characteristics. Cascading is a specific routing strategy that escalates queries across model tiers when confidence is insufficient.

Mixture-of-Experts (MoE) Routing

MoE-style routing selects different models based on task type, improving both cost efficiency and performance.

Selection Criteria

Factor	Consideration
Task diversity	Different query types require different capabilities
Cost sensitivity	Whether routing reduces use of expensive models
Latency tolerance	Whether routing overhead is acceptable
Volume	Whether scale justifies multiple model tiers
Quality variance	Whether specialized models outperform general models for specific tasks

MoE routing is most effective when workloads are heterogeneous, and routing decisions can be measured and optimized.

Confidence Estimation for Cascading

Cascading relies on identifying when a lower-tier model is insufficient and escalation is required. Confidence signals should be derived from measurable indicators rather than assumed model certainty.

The methods below represent practical approaches to estimating confidence. Each method provides a signal that can be combined or calibrated based on evaluation data.

Method	How It Works	Accuracy	Latency Cost	Recommendation
Token Probabilities	Use model’s logprob output; low max-prob = uncertain	Moderate	Zero extra latency	Good first signal; combine with other methods
Consistency Check	Run same query 2× with temperature>0; disagreement = uncertain	High	+1 full LLM call	Use for mid→frontier escalation decisions
Separate Classifier	Fine-tuned binary model: ‘can small model handle this?’	High (tunable)	Small LM latency only	Best for high-volume, well-defined domains

Confidence Calibration

Confidence signals should be calibrated against evaluation data rather than treated as intrinsic truth.

A practical approach is to measure:

Calibration accuracy = correctly handled queries within a confidence band / total queries in that band.

This allows routing thresholds to be tuned based on acceptable error rates, escalation volume, latency, and cost.

Illustrative Routing Flow

Query → Small model tier

If confidence is above the upper threshold, return result

If confidence falls within a mid-range, escalate to a stronger model

If confidence is below a lower threshold, escalate to the most capable model or a human review path

Threshold values should be calibrated using evaluation data rather than fixed globally.

Implementation: Cascading can be implemented in the LLM Gateway layer or using frameworks like LiteLLM (supports model fallback routing) or custom logic in the agent orchestration layer.

Note: The 80/15/5 split (80% small, 15% mid, 5% frontier) is a starting target. Measure actual distribution after initial few weeks and tune thresholds. Some domains will be 95/4/1; others 50/30/20.

4.2 Tool Registry & Lifecycle

Tools are the agent’s interface to external systems. A tool registry is the source of truth for what tools exist, how to call them, and how they behave when they fail. Without a registry, tools quickly become undocumented and difficult to monitor.

Tool Discovery & Registry Patterns

Tools may be discovered and managed using different approaches depending on system scale.

Discovery Pattern	Best for	Tradeoff
Static registry	Small, stable tool sets	Simple, manual to maintain
Dynamic registry	Frequently changing tool sets	Flexible, adds runtime dependency
MCP-based discovery	Multi-agent and interoperable systems	Standardized, requires protocol support

Decision Guidance (Illustrative)

Static registries are suitable for controlled environments. Dynamic or MCP-based discovery becomes more appropriate as tool count, update frequency, or interoperability requirements increase.

Registry Element	What It Contains	Why It Matters
Identity	Name, version, owner, deprecation date	Prevents calling stale or unmaintained tools
Contract	OpenAPI spec, input/output schema	Enables the model to call tools correctly without examples
SLA	Timeout (typically 30s), success rate target (99.5%)	Allows the agent to make informed retry/fallback decisions
Fallback Chain	Primary → Secondary → Escalate	Defines behavior before failure occurs, not during incident response
Cost	Per-call cost estimate	Enables cost-aware routing and budget enforcement

Tool Discovery by Scale

Tool Count	Discovery Method	Implementation
<20 tools	Static JSON manifest	All tools listed in system prompt; model selects from full set
20–100 tools	Dynamic retrieval	Query embedding → retrieve top-5 relevant tools → model selects
>100 tools	Capability-based routing + MCP	Intent → tool category → specific tool. MCP (Model Context Protocol) enables dynamic capability advertisement.

Tool Selection Considerations

Tool selection should consider factors such as the number of tools in the system, update frequency, interoperability requirements, latency constraints, and governance needs such as access control and auditability. These considerations influence whether static registries, dynamic discovery, or MCP-based approaches are appropriate.

Tool Versioning & Breaking Changes

Tool APIs change. When an external API changes its contract, the agent should remain backward compatible and resilient to changes.

Tools Version management strategy:

Tools are typically versioned explicitly (for example v1, v2) rather than modified in place to ensure backward compatibility and safe upgrades.
Run shadow traffic: new version receives all requests in parallel (no user impact) for 48–72 hours before cutover. Compare outputs.
Deprecation path: old version remains callable for 30 days post-cutover. Log all calls to deprecated versions as warnings.
Fallback to previous version: if new version error rate exceeds 5%, automatically fall back to the previous version.

4.3 Retry, Timeout & Circuit Breaker

Every external tool call can fail. The retry strategy determines how the agent recovers — or knows when to stop trying.

Parameter	Typical configuration	Rationale
Max Retries	3	Beyond 3, the tool is likely experiencing an outage, not a transient error
Base Backoff	1 second	Exponential: 1s → 2s → 4s. Prevents load spikes on shared services.
Jitter	±25% of backoff	Prevents retry storms when multiple agents hit the same failing tool simultaneously
Timeout (per call)	30 seconds	Long enough for most APIs; short enough to not block workflow indefinitely
Circuit Breaker Open	After 3 failures in 60s	Immediately fail fast once a tool is known-broken; prevents wasted retries
Circuit Breaker Reset	After 30s half-open probe	Allow one test call to check if the tool has recovered

⚠ Anti-pattern

Retrying indefinitely or with fixed intervals. Infinite retries can block workflows indefinitely under persistent failures. The use of fixed intervals may lead to excessive retry activity in widely utilized tools.

4.4 The LLM Gateway

The LLM Gateway is the single point through which all model calls flow. It provides model routing (cascading), rate limiting, prompt caching, cost metering, and observability. Without it, there is no visibility into model usage and no ability to optimize costs.

Request Flow: User Request → API Gateway (auth, rate limiting) → LLM Gateway (model routing, caching, metering) → LLM Model Provider APIs(example: GPT-class or Claude-class models)

Gateway Option	Best For	Key Features	Cost
LiteLLM	Teams starting out, open-source preference	100+ provider support, unified API, model routing	Free (open source)
Portkey	Growing teams (10K–100K req/day)	Advanced routing, prompt caching, A/B testing, observability dashboards	Per-request pricing
Custom-built	High-volume (>100K req/day) with unique routing logic	Full control, custom algorithms, internal tooling integration	Engineering investment

Cost Attribution: The LLM Gateway should tag every request with originating agent, skill invoked, workflow ID, and model tier used. This is the foundation for cost-per-workflow analysis — without it, expensive workflows cannot be identified and optimized.

Note: Calling model APIs directly from agent code without a gateway is acceptable for prototyping. For production systems, an LLM Gateway is essential for visibility, cost control, and optimization.

Key Takeaways

Use case complexity is a key factor in determining whether agentic AI and multi-step reasoning are appropriate.
Skill contracts benefit from structured inputs/outputs, error handling, and fallback paths to support testing and composition.
RAG pattern selection is typically driven by query complexity, quality requirements, and whether retrieval is necessary.
Agent frameworks are evaluated based on use case fit (e.g., capability coverage) and operational overhead (e.g., latency), using real workload benchmarks.
Model cascading is commonly used to improve cost efficiency by routing queries to the most cost-effective capable model.

Lavanya Subbarayalu is an enterprise AI architect specializing in large-scale AI platforms, agentic systems, and responsible AI design. She focuses on building practical, production-ready architectures that enable scalable, reliable, and governed deployment of AI systems across enterprise environments.