Enterprise Agentic AI Architecture Design Guidance – Part 2

… Protocols, Evaluation, Infrastructure & Security

By Lavanya Subbarayalu, Principal Solution Architect, PwC

Designing Enterprise Agentic AI Systems

Part 2 covers the operational and governance layers of agentic AI systems: how agents communicate and coordinate (Area 5), how system behavior is evaluated (Area 6), how agent systems are deployed and scaled (Area 7), and how they are secured and governed (Area 8). These areas transform a working prototype into a reliable and governable production system.

Each area builds on decisions made in Part 1. For example, the number of skills and selection of microservices (Areas 1 and 3) directly determine the multi-agent coordination patterns available here.

These operational areas extend the architectural principles discussed in Part 1.

Prerequisite: Please read Part 1 (Areas 1–4) first, as architectural decisions there constrain the operational options described here.

5. Protocols, Planning & Multi-Agent Coordination

Enterprise agent systems operate as distributed workflows where agents exchange messages, collaborate on reasoning tasks, and execute actions across services. Designing these interactions requires clear communication protocols, coordination patterns, and execution strategies.

This section defines how agents communicate, plan, and coordinate work, and how common failure modes such as deadlocks can be avoided.

5.1 Communication Protocol Selection

Communication protocols define how agents exchange information, invoke tools, and interact with other systems.

Protocol Purpose Typical Use Selection Criteria
MCP (Model Context Protocol) Standardized tool access Agent → tool interaction Structured tool contracts required
A2A Agent‑to‑agent communication Multi‑agent coordination Frequent inter‑agent interaction
AP2 Agent payment protocol Transactional workflows Regulated financial interactions
Agent Network Protocol Agent ecosystem discovery Cross‑organization coordination Dynamic ecosystem integration
agents.json Capability advertisement Agent discovery Dynamic capability routing
REST or gRPC APIs Service interface contracts Enterprise Domain Services Integration with existing APIs

Decision Guidance:

Protocols are typically selected based on interaction layers. For example, tool invocation may use MCP, while agent coordination may use A2A messaging.

Multi-Protocol Strategy:

Enterprise agent systems rarely rely on a single protocol. Different interaction layers typically use different protocols: tool interaction may use MCP, agent-to-agent coordination may use A2A, and discovery may use registry-based approaches or agents.json. Multi-protocol architecture should define protocol boundaries explicitly to avoid overlap, inconsistent contracts, and governance gaps.

5.2 Interaction Patterns

Interaction patterns determine how agents exchange information and coordinate tasks across system boundaries.

The choice between synchronous and asynchronous communication is one of the important architectural decisions for multi-agent systems. It affects latency, fault tolerance, and system complexity.

Interaction Pattern Selection Criteria:

The following latency ranges are example starting points and should be calibrated based on domain requirements, workload characteristics, and user experience expectations.

Pattern Mechanism Latency When to Use Failure Behavior Selection Criteria
Synchronous request-response Direct API call; caller blocks until response Predictable (call time) Tasks <5s, immediate response needed. Short deterministic tasks Caller knows immediately if downstream failed Immediate response required
Asynchronous messaging Message queue (Kafka, SQS); caller continues Variable (queue depth dependent) Long tasks (>30s), workflow pipelines Caller decoupled; failures detected via dead letter queue Unpredictable task duration
Streaming Continuous connection (WebSocket/SSE) Real-time (<100ms chunks) Token-by-token generation, live dashboards Connection drop requires client reconnect logic Real-time interaction
Delegation with callback Task assignment with  asynchronous callback when the delegated task completes Sum of subtask times Clear ownership handoff between specialized agents Delegators should track and timeout pending delegations Expertise required

Decision Guidance:

Asynchronous messaging is typically used for workflows that exceed interactive latency thresholds or have unpredictable execution times.

Anti-pattern

Using synchronous calls for workflows with unpredictable or long execution times.

Synchronous calls block the caller and hold connection and memory resources for the duration of the request. Under high concurrency, this can lead to connection pool exhaustion, thread starvation, and cascading failures.

5.3 Message Schema & Event Design

In event-driven agent architectures, the message schema is the contract between agents. Poorly designed message schemas are a frequent cause of difficult-to-diagnose failures in distributed agent workflows.

Required Fields in Every Agent Message:

Field Type Purpose
correlation_id UUID Traces a request across the entire agent chain. Every downstream agent inherits this ID.
agent_id String Identifies which agent produced this message. Critical for debugging and audit.
Timestamp ISO 8601 When the message was produced. Essential for ordering and drift detection.
schema_version Semver Which message schema version this conforms to. Enables backward-compatible evolution.
idempotency_key UUID Prevents duplicate processing. If a message is retried, the receiver can detect and skip duplicates.
TTL Seconds Time-to-live. Messages older than TTL are moved to dead letter queue, not processed.
trace_id UUID Enables distributed tracing across services

Note: Agent messages should be immutable once emitted. Downstream services should append metadata rather than modifying existing fields. Use schema_version to let receivers handle multiple versions gracefully. This prevents subtle debugging issues in distributed agent workflows.

Semantic Grounding & Ontology

Multi-agent systems require shared semantics. Terms such as priority, urgency, approval state, and task type should be defined consistently across agents through shared schemas, vocabularies, and ontology registries. Without this grounding, agents may interpret identical messages differently.

5.4 Agent Discovery

In static systems, agents are hard coded to call specific endpoints. In dynamic systems (microservices, auto-scaling), agents should discover each other at runtime.

The following ranges illustrate typical patterns rather than strict thresholds.

Scale Discovery Method How It Works Selection criteria
<10 agents Static configuration Agent endpoints in config file or environment variables. Simple, no runtime dependency. Small deployments
10–50 agents Service registry Agents register on startup; callers query registry by name. Standard service mesh pattern (Consul, etc.). Moderate scale
>50 agents Capability-based routing Agents advertise capabilities (not names). Caller describes what it needs; router matches the best available agent. Enables dynamic composition. Dynamic environments
Dynamic ecosystems Role-based routing Agents selected using skill matching, workload balancing, or specialization Heterogeneous agent pools

 Decision Guidance:

Static discovery approaches become difficult to maintain as agent counts increase beyond small-scale deployments (for example, ~10–20 agents).

⚠ Anti-pattern

Hard-coding agent endpoints in large systems. This prevents horizontal scaling and makes deployment brittle.

Dynamic Role Allocation

In large multi-agent systems, agent discovery may be selected dynamically based on skill matching, workload balancing, specialization or bidding mechanisms depending on system complexity and resource constraints.

5.5 Confidence Scoring & Escalation

Not every agent decision deserves the same level of trust. A confidence-based escalation framework lets low-risk decisions flow automatically while routing uncertain decisions to human review. This mechanism helps maintain human oversight in production systems.

The thresholds below are illustrative starting points and should be calibrated based on domain risk tolerance, evaluation results, and operational requirements.

Confidence Range Action Logging Review
>0.9 Execute automatically Standard log None required
0.7–0.9 Execute, flag for async review Detailed log with reasoning Reviewed within 24 hours
0.5–0.7 Escalate to supervisor agent Full trace logged Supervisor may re-route or escalate to human
<0.5 Escalate to human operator Full trace + alert triggered Human reviews before actions are executed

Calibration:

Confidence thresholds should be calibrated using historical evaluation data and domain risk tolerance. For example : a high confidence score in a low-stakes recommendation system is acceptable to act on, whereas the same score in a financial transaction system should trigger human review. Thresholds should be calibrated using the gold-standard evaluation dataset.

Confidence thresholds may vary by phase; planning, execution, and high-risk actions often require different confidence thresholds.

Anti-pattern

Treating confidence scores as absolute truth.  Confidence values are model-dependent and should be calibrated against evaluation datasets.

5.6 Multi-Agent Coordination Patterns

When multiple agents collaborate on a shared goal, the coordination pattern determines how work is divided, how conflicts are resolved, and how the system fails.

Pattern Structure Best For Failure Risk
Supervisor One orchestrator assigns tasks to worker agents <20 agents, predictable workflows Supervisor is single point of failure — should have fallback
Peer-to-Peer Agents communicate directly, no central coordinator Resilience-critical systems Complex debugging; no single source of truth
Hierarchical Multi-level supervisors: top → mid → workers >50 agents, large enterprise systems Deep hierarchy adds latency at each level
Pipeline Linear chain: Agent A → Agent B → Agent C Sequential workflows with clear handoffs Single failure breaks the chain — needs checkpoint/resume
Parallel Fan-out Orchestrator splits work, agents execute in parallel, results merged Independent subtasks (e.g., multi-source research) Merge logic should handle partial failures and partial results
Delegation Manager assigns tasks to specialist agents Clear accountability and specialist routing

 

Manager bottleneck
Broadcast One event is published to multiple subscribers

 

One-to-many notification

 

Ordering and response coordination
Sync Mesh

Direct

 

1:1 handoff with full context transfer

 

Precise specialist handoff

 

Blocking and single-failure risk
Swarm

 

Agents dynamically self-organize by capability Large dynamic systems without fixed supervisor

 

Discovery overhead
Group Chat

Agents

 

Collaborate in a shared conversational channel

 

Transparent collaborative reasoning

 

Token-heavy and slower
Negotiate

Agents

 

Iteratively propose and counter-propose

 

Resource contention and competing goals

 

Deadlock or long convergence
Human-in-the-Loop

 

 

 

 

Humans participate in coordination path Safety-critical or regulated workflows
Human latency

 Decision Guidance:

The simplest coordination pattern is typically sufficient for most workflows. Advanced coordination patterns such as swarm, negotiation, or group chat are typically introduced when scale or coordination complexity exceeds what simpler orchestration models can handle. Complex coordination structures increase failure modes and debugging complexity.

5.7 Planning & Reasoning Coordination

Planning and reasoning coordination defines how tasks are structured and how multiple agents contribute to decision-making during runtime execution.

Planning coordination patterns selection criteria:

Pattern Best For Tradeoff Selection Criteria
Distributed Planning

 

Independent tasks with minimal dependencies

 

Hidden dependencies may be missed Independent tasks with minimal dependencies
Hierarchical Delegation Planning Complex workflows with clear hierarchy Coordinator bottleneck

 

Centralized visibility and task ownership required
Parallel Consensus Planning Multiple candidate plans need comparison Higher latency and token cost Decision quality is prioritized over cost

 

Collaborative Planning Shared constraints and competing priorities Slower convergence Balanced outcomes required across stakeholders

 

Decision Guidance:

Hierarchical delegation planning is commonly used for enterprise workflows due to its structured control and visibility. Distributed Planning can be used when task boundaries are proven to be independent.

Planning pattern selection should also consider coordination overhead. Distributed approaches minimize coordination cost but may miss dependencies. Hierarchical approaches provide control but introduce orchestration overhead. Consensus and collaborative approaches increase decision quality but incur higher latency and token cost.

Reasoning coordination patterns selection criteria: 

Pattern Best For Tradeoff Selection Criteria
Sequential Handoff Reasoning

 

Staged workflows where later reasoning depends on earlier conclusions

 

Slower end-to-end execution Clear reasoning chain across specialist roles
Parallel Consensus Reasoning High-stakes decisions requiring independent validation 2–3x reasoning cost Correctness is more important than cost
Debate Reasoning Ambiguous or conflicting interpretations Slowest and most token-intensive No clear protocol match or single authority

Decision Guidance:

Sequential handoff reasoning is typically used as a default coordination approach for production workflows. Parallel consensus or debate reasoning is more commonly applied in high-risk or ambiguous scenarios where additional validation is required.

5.8 Decisioning & Execution Coordination Patterns

Pattern Dependency Structure Best For Tradeoff Selection Criteria
Sequential Execution Linear workflow

 

Dependent tasks and strict order

 

Predictable, slower Strict ordering is required
Parallel Execution Independent tasks Faster, more synchronization complexity Higher coordination overhead Tasks are independent
Conditional Execution Workflows with state-based branching Flexible decision paths High validation effort

 

Runtime state determines path
Iterative Execution Refinement and repeated verification Higher Quality outputs Risk of reasoning loops Quality improvement scenarios where refinement is required

Plan-Verify Pattern:

In the plan-verify pattern, an agent proposes a plan, and a verifier validates it against constraints such as completeness, dependency order, resource availability, and safety. High-risk workflows should validate plans before execution rather than after failure.

Verification Approach When to use Tradeoff Selection Criteria
Rule-Based Verification

 

 

 

Deterministic policy checks Fastest, but limited to known rules

 

Known constraints
Simulation Verification Feasibility checks with state transitions More accurate, slower Complex workflows
LLM-as-Judge Verification Context-dependent and ambiguous evaluations / reasoning Flexible, but costs tokens Contextual validation
Hybrid Verification High-stakes workflows Strongest coverage, highest cost Maximum Reliability

Decision Guidance:

Rule-based verification is typically applied for deterministic constraints, simulation for workflow feasibility, and LLM-as-judge for contextual validation. Hybrid verification is commonly used in high-risk workflows where multiple validation layers are required.

5.9 Deadlock Detection & Prevention

Deadlocks are a common failure mode in distributed systems and can emerge in multi-agent coordination workflows. Agent A waits for Agent B, which waits for Agent C, which waits for Agent A. In practice, deadlocks in agent systems usually arise from circular dependencies in task delegation.

Prevention Strategies

  • Enforce delegation depth limits: Delegation depth should be limited (typically 2–4 levels depending on workflow complexity). Beyond the limit, delegation should be escalated to humans. This establishes an upper limit for the cycle length.
  • Timeout every delegation: every delegated task has a hard timeout (default 60 seconds). When timeout fires, the task is cancelled and the delegator is notified. No task waits forever.
  • Track the delegation graph in real time: maintain a directed graph of active delegations. Before delegating, check if the target agent (or any of its dependents) already depends on the delegating agent. If yes, reject the delegation.
  • Dead letter queue for timed-out tasks: timed-out tasks go to a dead letter queue for human review. This is the escape hatch — no deadlock persists beyond the timeout window.

Anti-pattern

Assuming deadlocks won’t happen because the agents are ‘well-designed.’ Deadlocks are emergent properties of concurrent systems. Design for detection, not just prevention.

5.10 Context Handoff Management

When workflows move between agents, handoffs should preserve reasoning history, findings, metadata, and provenance. Context loss during handoff is a common coordination failure mode, especially in multi-stage workflows.

5.11 Consensus Mechanisms

When multiple agents should agree on a decision (e.g., risk assessment, content classification), a consensus mechanism defines how agreement is reached.

Mechanism Agreement Threshold Speed When to Use
Majority Vote >50% agree Fast (~500ms) Low-stakes fast decisions, informational queries
Weighted Consensus Weighted by expertise/confidence score Fast (~500ms) When agents have different specialization levels. Expert decisions, quality focus
Quorum (K of N) K out of N agents agree (for example 2 of 3 agents) Medium (~1s) Production standard for most decisions
Unanimous 100% agreement required Slow (bounded by slowest agent) Safety-critical decisions only — e.g., medical, financial
Debate Moderator selects strongest argument Slow Complex diagnosis or ambiguous reasoning tasks
Byzantine Fault Tolerance (BFT) Tolerates malicious or faulty nodes Slow Multi-vendor or adversarial environments

BFT is typically used in multi-vendor or untrusted environments where agents may behave inconsistently or maliciously, and fault tolerance is required.

Conflict Resolution (when consensus fails):

  • Confidence-based: if one agent’s confidence exceeds another’s by more than 0.15, it wins automatically. This avoids unnecessary escalation when one agent is clearly more certain.
  • Priority-based: pre-defined priority tiers for specific domains (e.g., compliance agent overrides general agents on regulatory questions).
  • Human escalation: after 3 automated rounds without consensus, escalate to human. Do not loop indefinitely.

5.12 Advanced Coordination Mechanisms

Large-scale systems may incorporate additional coordination capabilities.

Adaptive Decisioning

Systems may dynamically adjust coordination topology, protocol selection, or routing thresholds based on runtime performance signals.

Reputation & Trust Scoring

Agent success rate, latency, and calibration accuracy can influence routing decisions in large multi-agent ecosystems.

Cost-Aware Orchestration

Routing policies may consider model cost, token usage, and latency budgets to optimize resource utilization.

6. Evaluation & Feedback Loops

Reliable agentic systems require robust evaluation. This section describes how agent workflows are validated before deployment, monitored for degradation in production, and improved through continuous feedback.

The central challenge differs from traditional model evaluation. Instead of assessing a single prediction, agentic systems require evaluation of an entire workflow — a sequence of decisions where errors can accumulate across steps.

6.1 Building the Evaluation Dataset

Evaluation pipelines require ground truth data to validate system behavior. For agent systems, the initial challenge is dataset bootstrapping, where production data does not yet exist, yet evaluation requires representative tasks.

A common approach is to bootstrap an initial evaluation dataset before deployment and refine it as production traffic becomes available.

Phase Dataset Size (illustrative) Source Method
Bootstrap 50–100 queries Manual creation by domain experts Write queries that cover happy path, edge cases, failure modes, adversarial inputs
Expansion 200–500 queries Real user queries (anonymized) + synthetic Capture first 200 real queries: label with domain expert. Add synthetic edge cases.
Scale 1000+ queries Production replay (anonymized) + synthetic generation 70% real (anonymized), 20% synthetic edge cases, 10% gold-standard expert-labeled

 

Common evaluation dataset distribution:

In steady state, the evaluation dataset is approximately 70% anonymized production queries (daily refresh), 20% synthetic edge cases (quarterly refresh), and 10% gold-standard expert-labeled queries (monthly review).

The gold-standard set is the anchor — it is the set that detects regressions.  Evaluation datasets should evolve alongside production traffic to detect real-world drift.

6.2 Evaluating Multi-Step Workflows

Evaluating single-turn responses is relatively simple, as it involves determining whether the model provided the correct answer. However, assessing multi-step workflows is more challenging. An error occurring in an early step, such as step 2, may not become apparent until a later stage, such as step 5. Therefore, it is essential to assess both the overall outcome as well as the accuracy of each intermediate step.

 

Evaluation Level What To Measure Method Example
Component Each skill in isolation Unit test with mocked inputs verify_identity returns correct result for known test cases
Step Accuracy Each step in the workflow Trace-based: compare each step’s output to expected Step 2 (route to specialist) correctly routes 95% of cases
End-to-End Final outcome vs ground truth Full workflow execution with real or simulated inputs Customer issue resolved in ≤3 turns
Path Validity Did the agent take a VALID path? Compare taken path to set of acceptable paths Any path that reaches resolution is valid, even if non-standard

Note:

Path validity is one of the most valuable evaluation signals in agent systems. An agent may reach a correct outcome through a different but valid reasoning path. Evaluation should therefore define acceptable paths rather than enforcing a single expected path.

Anti-pattern

Evaluating only final outputs while ignoring intermediate reasoning steps.

6.3 Regression Detection Between Versions

Ensuring that deploying a new model version or updating a skill does not negatively impact performance requires careful evaluation and monitoring.

Statistical significance testing is necessary, since a 1% decrease in accuracy on a test set of 100 queries could simply be random variation, while the same drop on a test set of 10,000 queries indicates a meaningful change.

Metric Regression Signal Statistical Test Action
Task Completion Rate Drop >2% absolute Chi-squared test (p<0.05) Block deployment; investigate
Response Latency (p95) Increase >20% Mann-Whitney U test Flag; proceed if latency is still within SLO
Hallucination Rate Increase >0.5% absolute Fisher’s exact test (small counts) Block deployment immediately
Cost per Query Increase >30% Simple comparison vs baseline Flag; may indicate model tier drift in cascading

Shadow Evaluation:

Shadow evaluation can be used when introducing a new version. In this approach, the new system receives production traffic, but its responses are not exposed to users.

Shadow outputs are compared with the live system on the same inputs to assess performance across key metrics. Deployment decisions of new versions are typically based on whether the shadow system demonstrates comparable or improved performance.

6.4 Adversarial Testing

Adversarial testing probes the system for vulnerabilities that normal testing misses. For agent systems, the attack surface is larger than for standard applications — prompts are user-controlled input, and agents can take actions (not just return text).

Attack Type Description Vectors Frequency Target
Prompt Injection Attacker embeds instructions in input to override agent behavior Direct injection in query, indirect via retrieved documents, multi-turn escalation Weekly 0% success rate
Jailbreak Attacker bypasses safety guardrails via roleplay, hypotheticals, or encoding DAN prompts, fictional scenarios, base64-encoded instructions Monthly 0% success rate
Tool Abuse Attackers manipulate agents into calling tools with malicious parameters Crafted inputs that cause SQL injection, file access, or privilege escalation via tool parameters Monthly 0% success rate
Red Team Structured adversarial testing by dedicated team 1000+ automated scenarios + human creative attacks Quarterly >95% detection rate
Multi-Agent Attack Compromised agent impersonates another or intercepts messages Agent impersonation, man-in-the-middle on agent communication, privilege escalation via delegation Monthly 100% detection rate
Data Exfiltration Agent tricked into revealing confidential data Indirect retrieval prompts, chain attacks Monthly 0% success

Anti-pattern

Only testing with ‘friendly’ inputs. Adversarial testing should be a formal, scheduled process with defined attack vectors. Ad-hoc testing catches obvious issues; systematic adversarial testing catches the subtle ones that cause production incidents.

6.5 Feedback & Continuous Improvement

Evaluation without feedback limits the ability to improve system behavior. Feedback loops connect evaluation results to system updates, enabling alignment between observed performance and intended outcomes.

Feedback Pattern How It Works When to Use Improvement Cycle
Human-in-the-Loop (HITL) Oversight Human monitors agent actions; intervenes on anomalies High-stakes domains (financial, medical) Continuous — interventions inform model updates
HITL Approval Agent proposes action → human reviews → approves/rejects → agent executes Actions with significant consequences Approval data becomes training signal
Implicit Feedback Track whether the user’s problem was resolved (follow-up queries, satisfaction signal) All production agents Weekly rollup informs prompt and skill tuning
Active Learning Flag low-confidence queries (say <0.7) for human labeling; use labels to fine-tune or update prompts High-volume domains where labeling is feasible Batch update monthly; 50–80% reduction in labeling cost over time

7. Infrastructure, Deployment & Performance

This area covers the operational foundation: where models run, how agent systems are deployed, how performance is maintained at scale, and how costs are managed. The key insight for agent systems: infrastructure decisions are tightly coupled to model decisions (Area 4). Model cascading, tool parallelism, and RAG caching all have infrastructure implications.

7.1 Where Models Run

The decision of where to run models — cloud API, hybrid, or on-premises — is driven by volume, latency requirements, data sovereignty, and cost at scale.

The volume ranges below are illustrative reference points and should be calibrated based on workload characteristics, latency requirements, and cost constraints

Volume Burst Pattern Recommended Deployment Best For
<500K queries/day Unpredictable (10× spikes) Managed Cloud APIs Elastic scaling; no GPU management; pay per token
500K–2M/day Sustained with <3× peaks Hybrid: on-prem for baseline + cloud for burst Predictable baseline on-prem; cloud absorbs burst traffic
>2M/day Predictable, sustained On-premises (GPU cluster) Predictable baseline on-prem; cloud absorbs burst traffic
Data sovereignty required Any On-premises or private cloud Data should remain within network boundaries due to regulatory constraints

7.2 Serving Infrastructure

For on-premises or hybrid deployments, the model serving layer is a critical engineering decision. It affects throughput, latency, and GPU utilization.

Serving Framework Best For Key Feature Consideration
vLLM High throughput serving of open models PagedAttention — near-optimal GPU memory utilization Good choice for production open model serving
TGI (Text Generation Inference) Hugging Face model ecosystem Optimized for transformer architectures Good defaults; less tuning flexibility than vLLM
Ollama Developer machines, small-scale local inference Simple setup, model management Not designed for production-scale throughput
Managed Cloud Endpoints (Azure AI, AWS Bedrock, OpenAI, Anthropic, Vertex AI ) Managed model serving Fully managed scaling and availability Limited control over infra and tuning

7.3 GPU Memory Management for Multi-Model and Cascading Systems

Model cascading means multiple models should be available simultaneously. Loading all models simultaneously into GPU memory is often inefficient for multi-model systems.

The following strategies manage this:

Strategy How It Works When to Use Trade-off
Model Parallelism Split one large model across multiple GPUs Single frontier model >80GB Uses all GPUs efficiently; adds inter-GPU communication latency
GPU Time-Sharing (MPS) Multiple models share GPU via NVIDIA MPS 2–3 small/mid models on one GPU Simple; models compete for compute — latency unpredictable under load
Dedicated Pools Small models on GPU Pool A; frontier models on Pool B Clear tier separation, predictable latency Requires more GPUs but gives SLO-predictable performance
Offload to CPU Keep model weights in CPU memory; load to GPU on demand Infrequently used specialist models Cold-start latency on first call (seconds); subsequent calls are warm

These strategies are commonly used in multi-model and cascading architectures where multiple model tiers should be available concurrently.

Note:

For cascading architectures, dedicated pools are a common approach. Small/mid-tier models run on a shared pool with auto-scaling; frontier models run on a dedicated, always-warm pool, which provides a predictable latency for each tier.

7.4 Cold Start & Warm-Up

LLM serving may have cold-start latency: the first request after a model is loaded (or a container starts) may take significantly longer than subsequent requests due to model weight loading and JIT compilation. This directly impacts the p95/p99 latency SLOs.

Approach How It Works Latency Impact Cost
Always-warm pool Keep at least 1 instance of each model tier always running Eliminates cold start Baseline cost even at zero traffic
Pre-warm on scale-out When auto-scaler adds a pod, it pre-warms the model before accepting traffic Hides cold start behind scale-out delay Slight over-provisioning
Health-check warm-up Container health check includes a model inference call; pod is not marked ready until warm Traffic never hits a cold instance Slower pod readiness (~10–30s)

 

SLO Planning:

Typically, p95 SLO definition should account for cold-start latency. For example, if the warm inference p95 is 800ms but cold start adds 3s, then p95 SLO should reflect the expected mix of cold and warm calls based on the traffic pattern.

7.5 Deployment Strategies

Strategy How It Works Best For Rollback
Blue-Green Two identical environments; traffic switches atomically Mission-critical systems (life/safety) Instant (<10s) switch back to blue
Canary Route traffic gradually: 10% % → 50% → 100%, with automated checks at each stage Standard production deployments Automatic rollback if SLO gates fail at any stage
Shadow New version receives 100% of traffic in parallel but responses are not shown to users High-stakes validation before any user exposure No rollback needed — shadow is read-only

7.6 CI/CD Pipeline — 4 Gates

Every deployment should pass four automated gates. Any gate failure blocks the deployment.

The values below are illustrative reference points and should be calibrated using evaluation data, deployment risk, and acceptable error thresholds.

 

Gate What It Checks Pass Criteria Failure Action
Gate 1: Evaluation Model quality on gold-standard dataset Task completion >95%, hallucination rate <2%, error rate <1% Block deployment; alert ML team
Gate 2: Security Adversarial and compliance checks 0 critical findings, 0 PII leaks, 0 prompt injection successes Block deployment; alert security team
Gate 3: Performance Latency, cost, throughput vs baseline p95 latency <2× baseline, cost <1.5× baseline, throughput >0.8× baseline Block if SLO breached; flag if within 20% of threshold
Gate 4: Synthetic Load Simulation Full workflow simulation (realistic tool latency, external responses) under synthetic load Workflow completion >95%, handoff loss <1%, 0 deadlocks detected Block deployment; investigating coordination issues

7.7 Automated Rollback Triggers

Rollback should be automatic for critical metrics. Defining trigger thresholds in advance prevents delayed detection during incidents.

The triggers below are illustrative reference points and should be calibrated based on system behavior, acceptable risk thresholds, and operational requirements.

Metric Rollback Trigger

(illustrative)

Detection Window (Indicative) Action
Error Rate >3% (5× baseline) 5 minutes Immediate rollback to previous version
Latency p95 >5s (sustained across at least 3 monitoring intervals) 10 minutes Immediate rollback
Hallucination Rate >5% (2.5× baseline) 15 minutes Rollback + alert + incident ticket
Task Completion <85% (10% drop) 30 minutes Rollback + investigation
Cost per Query >3× baseline (sustained) 1 hour Alert; rollback if not explained by traffic change

7.8 SLI/SLO Targets

Service Level Indicators (SLIs) are measurable metrics. Service Level Objectives (SLOs) are targets that have been defined.

The values given below are illustrative reference points and should be calibrated based on workload behavior, system performance targets, and operational constraints.

Metric (SLI) Example Target (SLO) Measurement Window Notes
Task Completion Rate >95% Rolling 24h End-to-end: query received → resolution delivered
Response Latency (p95) <2 seconds Rolling 1h Excludes streaming scenarios
Error Rate <1% Rolling 1h System errors + tool failures + timeouts
Escalation Rate <10% Rolling 24h Queries requiring human intervention
Model Latency (p95) SLM <400ms, LLM <1.8s Rolling 1h Per-model-tier, excludes network
Tool Success Rate >98% Rolling 1h Including retries, excluding circuit-broken tools
Cache Hit Rate >70% Rolling 1h RAG + prompt caching combined
Availability >99.9% Rolling 30 days System uptime (health check based)

Note: These are starting points. A customer-facing chatbot may need p95 <1s; an internal research tool may tolerate p95 <5s. Calibrate against the user experience expectations.

7.9 Cost Optimization — Prioritized by ROI

The prioritization and expected savings below are indicative and should be adjusted based on workload characteristics, cost drivers, and operational constraints.

Optimization Expected Savings (indicative)

 

When to Implement Complexity
Model Cascading 60–70% model cost reduction Often one of the first optimizations implemented. Medium (requires confidence measurement)
Prompt Caching 40–60% on repeated prompts When cache hit rate >50% is achievable Low (provider feature)
RAG Cache 30–50% on retrieval-heavy workflows When >70% cache hit achievable Medium (cache invalidation logic)
Prompt Optimization 20–30% token reduction When prompts exceed 2K tokens Low (iterative refinement)
Tool Parallelization 2–3× throughput (not cost, but $/throughput) When 5+ independent tools per workflow Medium (dependency graph analysis)
Spot Instances 30–60% compute cost on non-critical tasks For batch eval, non-real-time agent tasks Medium (spot interruption handling)

 

 

Anti-pattern

Deploying frontier models for all workloads without routing or cascading, leading to unnecessary costs and latency increases.

8. Security, Governance & Monitoring

Security for agentic AI systems extends beyond traditional application security because agents can execute actions, not just return text. A compromised or manipulated agent can modify databases, send messages, execute code, or call external APIs. This area covers the threat model, the security architecture, and the monitoring required to maintain trust in production.

8.1 Threat Model: MAESTRO 7-Layer Framework

MAESTRO (Model, Agent, Ecosystem, Security, Trust, Risk, Operations) is a layered threat modeling approach designed specifically for agent-based systems. It defines seven layers of threat surface specific to agentic AI systems. Each layer requires distinct controls.

Layer Threat Surface Primary Controls
1. Foundation Models Model manipulation via adversarial prompts 3-layer guardrails (input/processing/output), hallucination detection, output validation
2. Data Operations Data poisoning, PII exposure, unauthorized access PII detection at ingress/egress, 4-tier data classification, encryption at rest and in transit
3. Agent Frameworks Tool abuse, parameter manipulation, sandbox escape Tool-level RBAC, parameter validation against schema, execution sandboxing
4. Infrastructure Network interception, credential theft mTLS between all services, per-agent service accounts, secrets vault (HashiCorp Vault), 7-day cert rotation
5. Evaluation Undetected degradation, drift Confidence-based escalation, continuous drift detection, adversarial testing
6. Compliance Audit gaps, unauthorized autonomous action 3-tier autonomy levels, immutable audit logs, 4-gate approval for changes
7. Ecosystem Supply chain attacks, compromised dependencies Dependency scanning, security trimming of unused capabilities, handoff monitoring

8.2 Zero Trust Architecture for Agents

Traditional perimeter security assumes that entities within the network are trusted. In agent-based systems, where agents interact with external APIs and other agents, this assumption may not hold. Zero-trust principles require that each request is verified, regardless of its origin.

Per-Agent Identity

  • Every agent instance has a unique identity (service account) not a shared credential.
  • All inter-agent and agent-to-service communication uses mTLS (mutual TLS) both sides prove identity.
  • Certificates rotate every 7 days. Short-lived credentials limit the blast radius of a compromised agent.

3-Tier Autonomy Levels

Tier Autonomy Level Examples Control (illustration)
T1: Autonomous Agent acts without approval Read operations, low-risk queries, informational responses Confidence should be >0.8
T2: Supervised Agent acts, human reviews afterward Write operations, moderate-impact actions Confidence should be >0.6; flagged for review
T3: Approved Agent proposes, human approves before execution Financial transactions, data deletion, external communications Humans should approve before any action

 

Autonomy Assignment:

Autonomy tiers are typically defined at the tool or action level rather than at the agent level. This allows a single agent to operate at different autonomy levels depending on the task (for example, read operations (T1) vs. write operations (T3)), and helps prevent overly broad permissions.

8.3 Cross-Agent Trust & Lateral Movement Prevention

If Agent A is compromised (via prompt injection or other attack), it should not be able to affect Agent B or access Agent B’s data. Lateral movement prevention is a fundamental requirement for multi-agent systems.

Control How It Works Effect
Network Segmentation Each agent runs in its own network segment; inter-agent traffic routes through a controlled gateway Compromised agent cannot directly reach other agents or their data stores
Token Scoping Each agent’s authentication token grants access only to its own tools and data — not shared resources Compromised credentials cannot be used to access other agents’ resources
Message Validation All inter-agent messages are validated against schema and signed with the sender’s identity Impersonation is detected before the message is processed
Delegation Audit Every delegation is logged in full context (who delegated, what task, what permissions were granted) Post-incident forensics can trace the exact delegation chain that led to compromise

8.4 Data Classification

Data classification determines what controls apply to each piece of data the agent handles. Classification should be applied before data enters the agent system — not after.

  • Public: No restrictions beyond rate limiting. Freely accessible, non-sensitive information.
  • Internal: Requires authentication. Employee-facing data. Agents may access but should not exfiltrate to external systems.
  • Confidential: Restricted access with pre-query filtering. Only agents with explicit permission may retrieve. All access is audit-logged.
  • Restricted: Requires human approval for agent access. Output filtering applied before returning to agent. Cryptographic audit logs with 7-year retention. Right-to-be-forgotten requests should be processable.

 Note: GDPR Compliance Challenge: immutable audit logs (for compliance) conflict with right-to-be-forgotten (for privacy). Resolution: audit logs record that access occurred and who performed it — but not the data content itself. Data content is stored separately and can be deleted independently.

8.5 Guardrails: Input, Processing, Output

Guardrails are the real-time controls that prevent agents from producing harmful, incorrect, or non-compliant outputs. They operate at three points in the pipeline:

Stage What It Catches Controls
Input Prompt injection, PII in user input, off-topic requests Injection classifier, PII scanner, intent validator — all run before the LLM sees the input
Processing Context window overflow, tool abuse, reasoning loops Context budget enforcement, tool access RBAC, max-iteration guards on reasoning loops
Output Hallucinated content, PII in response, unsafe content Hallucination detector (Example: confidence <0.6 → flag), PII scanner on output, content safety classifier

8.6 Hallucination Prevention & Detection

The techniques below represent commonly used approaches for reducing hallucinations and improving response reliability.

Control Type Target (Indicative) Method
Prevention Near-zero tolerance for action-triggering outputs RAG grounding (all factual claims should be traceable to source), structured output enforcement, citation requirements
Prevention <2% for informational text Few-shot examples of grounded responses, explicit ‘if unsure, say so’ instructions
Detection >80% agreement among judges 3-judge ensemble: run output through 3 independent evaluators; flag if <2 agree it is correct
Detection Deterministic checks For structured outputs: validate against schema, check consistency with retrieved context, flag numerical claims for verification

8.7 Monitoring & Observability Architecture

Effective monitoring requires visibility across infrastructure, model behavior, workflow execution, and business outcomes.

Layer Metrics Refresh Rate

(Indicative)

Audience
Infrastructure CPU/GPU utilization, memory, network I/O, vector DB latency Every 5 minutes Platform engineers
Component Model latency (p50/p95/p99), token cost, tool success rate, cache hit rate

Retrieval precision / recall

Every 5 minutes ML engineers
Workflow Task completion rate, handoff latency, coordination overhead, error rate by workflow type Every 5 minutes Application owners
Business User satisfaction (implicit signals), cost per resolution, compliance pass rate, escalation rate Daily rollup Product & leadership

8.8 AI Incident Response Playbook

Incident response processes are designed to detect, contain, and recover from failures in agent workflows. These processes support system reliability and enable issues to be addressed in a controlled and observable manner.

AI-specific failures often require specialized response procedures. The following is an illustrative playbook covering common production incidents.

Incident Type 1: Hallucination Detected in Production

  • Immediate action: The affected workflow is typically flagged. If the hallucination has triggered an external action (for example, write, send, or charge), compensation actions such as rollback, cancellation, or refund may be initiated.
  • Short-term (within ~1 hour): Output guardrail sensitivity may be increased, and the hallucinated pattern can be added to adversarial or evaluation test suites.
  • Long-term (within ~1 week): Root cause is analyzed, such as retrieval errors (incorrect RAG context) or model capability gaps, and the affected component is updated accordingly.

Incident Type 2: Prompt Injection Detected

  • Immediate action: The source of the injection (user account, API key, or input channel) is typically isolated or blocked, and the attack payload is logged for analysis.
  • Short-term (within ~1 hour): Actions taken by the agent following the injection are reviewed, and any unauthorized actions may be reversed.
  • Long-term (within ~1 week): Detection mechanisms are updated to incorporate the new attack pattern, and the scenario is added to red team or adversarial testing suites.

Incident Type 3: Model Drift Causing Incorrect Outputs

  • Immediate action: When drift is confirmed (for example, sustained metric degradation), systems may revert to a previously stable model version.
  • Short-term (within ~1 hour): Regression testing is performed on the previous version to confirm stability, and recent changes in model behavior or context are analyzed.
  • Long-term (within ~1 week): Drift detection sensitivity may be adjusted, and additional monitoring or canary strategies can be introduced for the affected model tier.

Production Readiness Checklist

Use this checklist across both documents before declaring production ready. Each item maps to a specific design area and section.

Area Checklist Item Verified
0 Agentic AI suitability assessed using the defined evaluation criteria
1 Each agent has single measurable goal
1 Skills have defined contracts (input/output/error/SLA/fallback)
1 Skill routing method selected and benchmarked
1 Context budget defined and enforced
1 All skill outputs are schema-validated
2 RAG-vs-no-RAG decision made
2 Embedding model benchmarked on domain queries
2 Chunking strategy matched to content type
2 Index maintenance strategy defined (rebuild/upsert schedule)
2 Memory types separated (working ≠ episodic)
3 Architecture pattern chosen (monolith/modular/microservices)
3 Partial failure handling defined (saga/checkpoint/compensation)
3 Reasoning pattern selected with defined iteration limits
3 Framework vs API decision made and benchmarked
4 Model cascading implemented with confidence measurement
4 Tool registry operational with lifecycle management
4 LLM Gateway deployed with cost attribution tagging
4 Retry and circuit-breaker mechanisms configured for tool calls
5 Communication, coordination patterns selected
5 Deadlock prevention controls in place (depth limit, timeout, graph check)
5 Confidence-based escalation thresholds calibrated
5 Consensus mechanism defined for multi-agent decisions
6 Evaluation dataset bootstrapped with ground truth
6 Multi-step workflow evaluation includes intermediate step checks
6 Regression detection configured with statistical significance tests
6 Adversarial testing suite operational (injection, jailbreak, tool abuse)
6 At least one feedback loop active (HITL or implicit)
7 Model deployment location decided based on volume
7 GPU memory management strategy defined for multi-model setup
7 Cold-start handling implemented (always-warm or pre-warm)
7 4-gate CI/CD pipeline operational
7 Automated rollback triggers configured
7 SLOs defined and monitored
8 MAESTRO 7-layer threat model addressed
8 Zero trust implemented (per-agent identity, mTLS)
8 3-tier autonomy assigned per tool/action
8 Cross-agent lateral movement prevention in place
8 Data classification applied before data enters agent system
8 Guardrails active at input, processing, and output stages
8 4-layer monitoring dashboards operational
8 AI incident response playbook documented and rehearsed

The difference between a pilot and a production system is systematic decision-making across these eight interconnected design areas.

Key Takeaways

  • Multi-agent systems require structured communication protocols and explicit coordination patterns.
  • Evaluation should assess workflow behavior, retrieval quality, and tool execution, not only model output.
  • Infrastructure architecture should support model routing, GPU resource management, and autoscaling.
  • Security controls should address tool permissions, adversarial prompts, and policy enforcement.
  • Observability and human oversight are essential for operational reliability and governance.

Conclusion

Agentic AI marks a transition from isolated model inference to systems where models reason, collaborate, and act within broader software architectures. In this environment, the reliability of the system depends not only on model capability but on how reasoning, coordination, execution, and governance are structured.

Part-1 of the Enterprise Agentic AI guidance addressed the foundational architecture for agents: defining goals, managing context and memory, designing retrieval pipelines, and integrating tools and models. Part-2 focused on the operational architecture required to run these systems in production — including communication protocols, evaluation frameworks, infrastructure patterns, and security controls.

Together, these two articles provide practical architectural guidance for building and operating Enterprise Agentic AI systems.

As organizations begin to deploy agent-driven workflows, success will depend on applying the same engineering discipline used in distributed systems design. Clear architectural boundaries, measurable evaluation practices, and strong governance mechanisms will determine whether agentic AI becomes a reliable enterprise capability or remains an experimental technology.

Lavanya Subbarayalu is an enterprise AI architect specializing in large-scale AI platforms, agentic systems, and responsible AI design. She focuses on building practical, production-ready architectures that enable scalable, reliable, and governed deployment of AI systems across enterprise environments.