… Protocols, Evaluation, Infrastructure & Security
By Lavanya Subbarayalu, Principal Solution Architect, PwC
Designing Enterprise Agentic AI Systems
Part 2 covers the operational and governance layers of agentic AI systems: how agents communicate and coordinate (Area 5), how system behavior is evaluated (Area 6), how agent systems are deployed and scaled (Area 7), and how they are secured and governed (Area 8). These areas transform a working prototype into a reliable and governable production system.
Each area builds on decisions made in Part 1. For example, the number of skills and selection of microservices (Areas 1 and 3) directly determine the multi-agent coordination patterns available here.
These operational areas extend the architectural principles discussed in Part 1.
Prerequisite: Please read Part 1 (Areas 1–4) first, as architectural decisions there constrain the operational options described here.
5. Protocols, Planning & Multi-Agent Coordination
Enterprise agent systems operate as distributed workflows where agents exchange messages, collaborate on reasoning tasks, and execute actions across services. Designing these interactions requires clear communication protocols, coordination patterns, and execution strategies.
This section defines how agents communicate, plan, and coordinate work, and how common failure modes such as deadlocks can be avoided.
5.1 Communication Protocol Selection
Communication protocols define how agents exchange information, invoke tools, and interact with other systems.
| Protocol | Purpose | Typical Use | Selection Criteria |
| MCP (Model Context Protocol) | Standardized tool access | Agent → tool interaction | Structured tool contracts required |
| A2A | Agent‑to‑agent communication | Multi‑agent coordination | Frequent inter‑agent interaction |
| AP2 | Agent payment protocol | Transactional workflows | Regulated financial interactions |
| Agent Network Protocol | Agent ecosystem discovery | Cross‑organization coordination | Dynamic ecosystem integration |
| agents.json | Capability advertisement | Agent discovery | Dynamic capability routing |
| REST or gRPC APIs | Service interface contracts | Enterprise Domain Services | Integration with existing APIs |
Decision Guidance:
Protocols are typically selected based on interaction layers. For example, tool invocation may use MCP, while agent coordination may use A2A messaging.
Multi-Protocol Strategy:
Enterprise agent systems rarely rely on a single protocol. Different interaction layers typically use different protocols: tool interaction may use MCP, agent-to-agent coordination may use A2A, and discovery may use registry-based approaches or agents.json. Multi-protocol architecture should define protocol boundaries explicitly to avoid overlap, inconsistent contracts, and governance gaps.
5.2 Interaction Patterns
Interaction patterns determine how agents exchange information and coordinate tasks across system boundaries.
The choice between synchronous and asynchronous communication is one of the important architectural decisions for multi-agent systems. It affects latency, fault tolerance, and system complexity.
Interaction Pattern Selection Criteria:
The following latency ranges are example starting points and should be calibrated based on domain requirements, workload characteristics, and user experience expectations.
| Pattern | Mechanism | Latency | When to Use | Failure Behavior | Selection Criteria | |
| Synchronous request-response | Direct API call; caller blocks until response | Predictable (call time) | Tasks <5s, immediate response needed. Short deterministic tasks | Caller knows immediately if downstream failed | Immediate response required | |
| Asynchronous messaging | Message queue (Kafka, SQS); caller continues | Variable (queue depth dependent) | Long tasks (>30s), workflow pipelines | Caller decoupled; failures detected via dead letter queue | Unpredictable task duration | |
| Streaming | Continuous connection (WebSocket/SSE) | Real-time (<100ms chunks) | Token-by-token generation, live dashboards | Connection drop requires client reconnect logic | Real-time interaction | |
| Delegation with callback | Task assignment with asynchronous callback when the delegated task completes | Sum of subtask times | Clear ownership handoff between specialized agents | Delegators should track and timeout pending delegations | Expertise required | |
Decision Guidance:
Asynchronous messaging is typically used for workflows that exceed interactive latency thresholds or have unpredictable execution times.
⚠ Anti-pattern
Using synchronous calls for workflows with unpredictable or long execution times.
Synchronous calls block the caller and hold connection and memory resources for the duration of the request. Under high concurrency, this can lead to connection pool exhaustion, thread starvation, and cascading failures.
5.3 Message Schema & Event Design
In event-driven agent architectures, the message schema is the contract between agents. Poorly designed message schemas are a frequent cause of difficult-to-diagnose failures in distributed agent workflows.
Required Fields in Every Agent Message:
| Field | Type | Purpose |
| correlation_id | UUID | Traces a request across the entire agent chain. Every downstream agent inherits this ID. |
| agent_id | String | Identifies which agent produced this message. Critical for debugging and audit. |
| Timestamp | ISO 8601 | When the message was produced. Essential for ordering and drift detection. |
| schema_version | Semver | Which message schema version this conforms to. Enables backward-compatible evolution. |
| idempotency_key | UUID | Prevents duplicate processing. If a message is retried, the receiver can detect and skip duplicates. |
| TTL | Seconds | Time-to-live. Messages older than TTL are moved to dead letter queue, not processed. |
| trace_id | UUID | Enables distributed tracing across services |
Note: Agent messages should be immutable once emitted. Downstream services should append metadata rather than modifying existing fields. Use schema_version to let receivers handle multiple versions gracefully. This prevents subtle debugging issues in distributed agent workflows.
Semantic Grounding & Ontology
Multi-agent systems require shared semantics. Terms such as priority, urgency, approval state, and task type should be defined consistently across agents through shared schemas, vocabularies, and ontology registries. Without this grounding, agents may interpret identical messages differently.
5.4 Agent Discovery
In static systems, agents are hard coded to call specific endpoints. In dynamic systems (microservices, auto-scaling), agents should discover each other at runtime.
The following ranges illustrate typical patterns rather than strict thresholds.
| Scale | Discovery Method | How It Works | Selection criteria |
| <10 agents | Static configuration | Agent endpoints in config file or environment variables. Simple, no runtime dependency. | Small deployments |
| 10–50 agents | Service registry | Agents register on startup; callers query registry by name. Standard service mesh pattern (Consul, etc.). | Moderate scale |
| >50 agents | Capability-based routing | Agents advertise capabilities (not names). Caller describes what it needs; router matches the best available agent. Enables dynamic composition. | Dynamic environments |
| Dynamic ecosystems | Role-based routing | Agents selected using skill matching, workload balancing, or specialization | Heterogeneous agent pools |
Static discovery approaches become difficult to maintain as agent counts increase beyond small-scale deployments (for example, ~10–20 agents).
⚠ Anti-pattern
Hard-coding agent endpoints in large systems. This prevents horizontal scaling and makes deployment brittle.
Dynamic Role Allocation
In large multi-agent systems, agent discovery may be selected dynamically based on skill matching, workload balancing, specialization or bidding mechanisms depending on system complexity and resource constraints.
5.5 Confidence Scoring & Escalation
Not every agent decision deserves the same level of trust. A confidence-based escalation framework lets low-risk decisions flow automatically while routing uncertain decisions to human review. This mechanism helps maintain human oversight in production systems.
The thresholds below are illustrative starting points and should be calibrated based on domain risk tolerance, evaluation results, and operational requirements.
| Confidence Range | Action | Logging | Review |
| >0.9 | Execute automatically | Standard log | None required |
| 0.7–0.9 | Execute, flag for async review | Detailed log with reasoning | Reviewed within 24 hours |
| 0.5–0.7 | Escalate to supervisor agent | Full trace logged | Supervisor may re-route or escalate to human |
| <0.5 | Escalate to human operator | Full trace + alert triggered | Human reviews before actions are executed |
Calibration:
Confidence thresholds should be calibrated using historical evaluation data and domain risk tolerance. For example : a high confidence score in a low-stakes recommendation system is acceptable to act on, whereas the same score in a financial transaction system should trigger human review. Thresholds should be calibrated using the gold-standard evaluation dataset.
Confidence thresholds may vary by phase; planning, execution, and high-risk actions often require different confidence thresholds.
⚠ Anti-pattern
Treating confidence scores as absolute truth. Confidence values are model-dependent and should be calibrated against evaluation datasets.
5.6 Multi-Agent Coordination Patterns
When multiple agents collaborate on a shared goal, the coordination pattern determines how work is divided, how conflicts are resolved, and how the system fails.
| Pattern | Structure | Best For | Failure Risk | ||||
| Supervisor | One orchestrator assigns tasks to worker agents | <20 agents, predictable workflows | Supervisor is single point of failure — should have fallback | ||||
| Peer-to-Peer | Agents communicate directly, no central coordinator | Resilience-critical systems | Complex debugging; no single source of truth | ||||
| Hierarchical | Multi-level supervisors: top → mid → workers | >50 agents, large enterprise systems | Deep hierarchy adds latency at each level | ||||
| Pipeline | Linear chain: Agent A → Agent B → Agent C | Sequential workflows with clear handoffs | Single failure breaks the chain — needs checkpoint/resume | ||||
| Parallel Fan-out | Orchestrator splits work, agents execute in parallel, results merged | Independent subtasks (e.g., multi-source research) | Merge logic should handle partial failures and partial results | ||||
| Delegation | Manager assigns tasks to specialist agents | Clear accountability and specialist routing
|
Manager bottleneck | ||||
| Broadcast | One event is published to multiple subscribers
|
One-to-many notification
|
Ordering and response coordination | ||||
| Sync Mesh
Direct
|
1:1 handoff with full context transfer
|
Precise specialist handoff
|
Blocking and single-failure risk | ||||
| Swarm
|
Agents dynamically self-organize by capability | Large dynamic systems without fixed supervisor
|
Discovery overhead | ||||
| Group Chat
Agents
|
Collaborate in a shared conversational channel
|
Transparent collaborative reasoning
|
Token-heavy and slower | ||||
| Negotiate
Agents
|
Iteratively propose and counter-propose
|
Resource contention and competing goals
|
Deadlock or long convergence | ||||
|
Humans participate in coordination path | Safety-critical or regulated workflows |
|
The simplest coordination pattern is typically sufficient for most workflows. Advanced coordination patterns such as swarm, negotiation, or group chat are typically introduced when scale or coordination complexity exceeds what simpler orchestration models can handle. Complex coordination structures increase failure modes and debugging complexity.
5.7 Planning & Reasoning Coordination
Planning and reasoning coordination defines how tasks are structured and how multiple agents contribute to decision-making during runtime execution.
Planning coordination patterns selection criteria:
| Pattern | Best For | Tradeoff | Selection Criteria | |||
| Distributed Planning |
|
Hidden dependencies may be missed | Independent tasks with minimal dependencies | |||
| Hierarchical Delegation Planning | Complex workflows with clear hierarchy | Coordinator bottleneck
|
Centralized visibility and task ownership required | |||
| Parallel Consensus Planning | Multiple candidate plans need comparison | Higher latency and token cost | Decision quality is prioritized over cost
|
|||
| Collaborative Planning | Shared constraints and competing priorities | Slower convergence | Balanced outcomes required across stakeholders
|
Decision Guidance:
Hierarchical delegation planning is commonly used for enterprise workflows due to its structured control and visibility. Distributed Planning can be used when task boundaries are proven to be independent.
Planning pattern selection should also consider coordination overhead. Distributed approaches minimize coordination cost but may miss dependencies. Hierarchical approaches provide control but introduce orchestration overhead. Consensus and collaborative approaches increase decision quality but incur higher latency and token cost.
Reasoning coordination patterns selection criteria:
| Pattern | Best For | Tradeoff | Selection Criteria | |||
| Sequential Handoff Reasoning |
|
Slower end-to-end execution | Clear reasoning chain across specialist roles | |||
| Parallel Consensus Reasoning | High-stakes decisions requiring independent validation | 2–3x reasoning cost | Correctness is more important than cost | |||
| Debate Reasoning | Ambiguous or conflicting interpretations | Slowest and most token-intensive | No clear protocol match or single authority |
Decision Guidance:
Sequential handoff reasoning is typically used as a default coordination approach for production workflows. Parallel consensus or debate reasoning is more commonly applied in high-risk or ambiguous scenarios where additional validation is required.
5.8 Decisioning & Execution Coordination Patterns
| Pattern | Dependency Structure | Best For | Tradeoff | Selection Criteria | |||
| Sequential Execution | Linear workflow |
|
Predictable, slower | Strict ordering is required | |||
| Parallel Execution | Independent tasks | Faster, more synchronization complexity | Higher coordination overhead | Tasks are independent | |||
| Conditional Execution | Workflows with state-based branching | Flexible decision paths | High validation effort
|
Runtime state determines path | |||
| Iterative Execution | Refinement and repeated verification | Higher Quality outputs | Risk of reasoning loops | Quality improvement scenarios where refinement is required |
Plan-Verify Pattern:
In the plan-verify pattern, an agent proposes a plan, and a verifier validates it against constraints such as completeness, dependency order, resource availability, and safety. High-risk workflows should validate plans before execution rather than after failure.
| Verification Approach | When to use | Tradeoff | Selection Criteria |
| Rule-Based Verification
|
Deterministic policy checks | Fastest, but limited to known rules
|
Known constraints |
| Simulation Verification | Feasibility checks with state transitions | More accurate, slower | Complex workflows |
| LLM-as-Judge Verification | Context-dependent and ambiguous evaluations / reasoning | Flexible, but costs tokens | Contextual validation |
| Hybrid Verification | High-stakes workflows | Strongest coverage, highest cost | Maximum Reliability |
Decision Guidance:
Rule-based verification is typically applied for deterministic constraints, simulation for workflow feasibility, and LLM-as-judge for contextual validation. Hybrid verification is commonly used in high-risk workflows where multiple validation layers are required.
5.9 Deadlock Detection & Prevention
Deadlocks are a common failure mode in distributed systems and can emerge in multi-agent coordination workflows. Agent A waits for Agent B, which waits for Agent C, which waits for Agent A. In practice, deadlocks in agent systems usually arise from circular dependencies in task delegation.
Prevention Strategies
- Enforce delegation depth limits: Delegation depth should be limited (typically 2–4 levels depending on workflow complexity). Beyond the limit, delegation should be escalated to humans. This establishes an upper limit for the cycle length.
- Timeout every delegation: every delegated task has a hard timeout (default 60 seconds). When timeout fires, the task is cancelled and the delegator is notified. No task waits forever.
- Track the delegation graph in real time: maintain a directed graph of active delegations. Before delegating, check if the target agent (or any of its dependents) already depends on the delegating agent. If yes, reject the delegation.
- Dead letter queue for timed-out tasks: timed-out tasks go to a dead letter queue for human review. This is the escape hatch — no deadlock persists beyond the timeout window.
⚠ Anti-pattern
Assuming deadlocks won’t happen because the agents are ‘well-designed.’ Deadlocks are emergent properties of concurrent systems. Design for detection, not just prevention.
5.10 Context Handoff Management
When workflows move between agents, handoffs should preserve reasoning history, findings, metadata, and provenance. Context loss during handoff is a common coordination failure mode, especially in multi-stage workflows.
5.11 Consensus Mechanisms
When multiple agents should agree on a decision (e.g., risk assessment, content classification), a consensus mechanism defines how agreement is reached.
| Mechanism | Agreement Threshold | Speed | When to Use |
| Majority Vote | >50% agree | Fast (~500ms) | Low-stakes fast decisions, informational queries |
| Weighted Consensus | Weighted by expertise/confidence score | Fast (~500ms) | When agents have different specialization levels. Expert decisions, quality focus |
| Quorum (K of N) | K out of N agents agree (for example 2 of 3 agents) | Medium (~1s) | Production standard for most decisions |
| Unanimous | 100% agreement required | Slow (bounded by slowest agent) | Safety-critical decisions only — e.g., medical, financial |
| Debate | Moderator selects strongest argument | Slow | Complex diagnosis or ambiguous reasoning tasks |
| Byzantine Fault Tolerance (BFT) | Tolerates malicious or faulty nodes | Slow | Multi-vendor or adversarial environments |
BFT is typically used in multi-vendor or untrusted environments where agents may behave inconsistently or maliciously, and fault tolerance is required.
Conflict Resolution (when consensus fails):
- Confidence-based: if one agent’s confidence exceeds another’s by more than 0.15, it wins automatically. This avoids unnecessary escalation when one agent is clearly more certain.
- Priority-based: pre-defined priority tiers for specific domains (e.g., compliance agent overrides general agents on regulatory questions).
- Human escalation: after 3 automated rounds without consensus, escalate to human. Do not loop indefinitely.
5.12 Advanced Coordination Mechanisms
Large-scale systems may incorporate additional coordination capabilities.
Adaptive Decisioning
Systems may dynamically adjust coordination topology, protocol selection, or routing thresholds based on runtime performance signals.
Reputation & Trust Scoring
Agent success rate, latency, and calibration accuracy can influence routing decisions in large multi-agent ecosystems.
Cost-Aware Orchestration
Routing policies may consider model cost, token usage, and latency budgets to optimize resource utilization.
6. Evaluation & Feedback Loops
Reliable agentic systems require robust evaluation. This section describes how agent workflows are validated before deployment, monitored for degradation in production, and improved through continuous feedback.
The central challenge differs from traditional model evaluation. Instead of assessing a single prediction, agentic systems require evaluation of an entire workflow — a sequence of decisions where errors can accumulate across steps.
6.1 Building the Evaluation Dataset
Evaluation pipelines require ground truth data to validate system behavior. For agent systems, the initial challenge is dataset bootstrapping, where production data does not yet exist, yet evaluation requires representative tasks.
A common approach is to bootstrap an initial evaluation dataset before deployment and refine it as production traffic becomes available.
| Phase | Dataset Size (illustrative) | Source | Method |
| Bootstrap | 50–100 queries | Manual creation by domain experts | Write queries that cover happy path, edge cases, failure modes, adversarial inputs |
| Expansion | 200–500 queries | Real user queries (anonymized) + synthetic | Capture first 200 real queries: label with domain expert. Add synthetic edge cases. |
| Scale | 1000+ queries | Production replay (anonymized) + synthetic generation | 70% real (anonymized), 20% synthetic edge cases, 10% gold-standard expert-labeled |
Common evaluation dataset distribution:
In steady state, the evaluation dataset is approximately 70% anonymized production queries (daily refresh), 20% synthetic edge cases (quarterly refresh), and 10% gold-standard expert-labeled queries (monthly review).
The gold-standard set is the anchor — it is the set that detects regressions. Evaluation datasets should evolve alongside production traffic to detect real-world drift.
6.2 Evaluating Multi-Step Workflows
Evaluating single-turn responses is relatively simple, as it involves determining whether the model provided the correct answer. However, assessing multi-step workflows is more challenging. An error occurring in an early step, such as step 2, may not become apparent until a later stage, such as step 5. Therefore, it is essential to assess both the overall outcome as well as the accuracy of each intermediate step.
| Evaluation Level | What To Measure | Method | Example |
| Component | Each skill in isolation | Unit test with mocked inputs | verify_identity returns correct result for known test cases |
| Step Accuracy | Each step in the workflow | Trace-based: compare each step’s output to expected | Step 2 (route to specialist) correctly routes 95% of cases |
| End-to-End | Final outcome vs ground truth | Full workflow execution with real or simulated inputs | Customer issue resolved in ≤3 turns |
| Path Validity | Did the agent take a VALID path? | Compare taken path to set of acceptable paths | Any path that reaches resolution is valid, even if non-standard |
Note:
Path validity is one of the most valuable evaluation signals in agent systems. An agent may reach a correct outcome through a different but valid reasoning path. Evaluation should therefore define acceptable paths rather than enforcing a single expected path.
⚠ Anti-pattern
Evaluating only final outputs while ignoring intermediate reasoning steps.
6.3 Regression Detection Between Versions
Ensuring that deploying a new model version or updating a skill does not negatively impact performance requires careful evaluation and monitoring.
Statistical significance testing is necessary, since a 1% decrease in accuracy on a test set of 100 queries could simply be random variation, while the same drop on a test set of 10,000 queries indicates a meaningful change.
| Metric | Regression Signal | Statistical Test | Action |
| Task Completion Rate | Drop >2% absolute | Chi-squared test (p<0.05) | Block deployment; investigate |
| Response Latency (p95) | Increase >20% | Mann-Whitney U test | Flag; proceed if latency is still within SLO |
| Hallucination Rate | Increase >0.5% absolute | Fisher’s exact test (small counts) | Block deployment immediately |
| Cost per Query | Increase >30% | Simple comparison vs baseline | Flag; may indicate model tier drift in cascading |
Shadow Evaluation:
Shadow evaluation can be used when introducing a new version. In this approach, the new system receives production traffic, but its responses are not exposed to users.
Shadow outputs are compared with the live system on the same inputs to assess performance across key metrics. Deployment decisions of new versions are typically based on whether the shadow system demonstrates comparable or improved performance.
6.4 Adversarial Testing
Adversarial testing probes the system for vulnerabilities that normal testing misses. For agent systems, the attack surface is larger than for standard applications — prompts are user-controlled input, and agents can take actions (not just return text).
| Attack Type | Description | Vectors | Frequency | Target |
| Prompt Injection | Attacker embeds instructions in input to override agent behavior | Direct injection in query, indirect via retrieved documents, multi-turn escalation | Weekly | 0% success rate |
| Jailbreak | Attacker bypasses safety guardrails via roleplay, hypotheticals, or encoding | DAN prompts, fictional scenarios, base64-encoded instructions | Monthly | 0% success rate |
| Tool Abuse | Attackers manipulate agents into calling tools with malicious parameters | Crafted inputs that cause SQL injection, file access, or privilege escalation via tool parameters | Monthly | 0% success rate |
| Red Team | Structured adversarial testing by dedicated team | 1000+ automated scenarios + human creative attacks | Quarterly | >95% detection rate |
| Multi-Agent Attack | Compromised agent impersonates another or intercepts messages | Agent impersonation, man-in-the-middle on agent communication, privilege escalation via delegation | Monthly | 100% detection rate |
| Data Exfiltration | Agent tricked into revealing confidential data | Indirect retrieval prompts, chain attacks | Monthly | 0% success |
⚠ Anti-pattern
Only testing with ‘friendly’ inputs. Adversarial testing should be a formal, scheduled process with defined attack vectors. Ad-hoc testing catches obvious issues; systematic adversarial testing catches the subtle ones that cause production incidents.
6.5 Feedback & Continuous Improvement
Evaluation without feedback limits the ability to improve system behavior. Feedback loops connect evaluation results to system updates, enabling alignment between observed performance and intended outcomes.
| Feedback Pattern | How It Works | When to Use | Improvement Cycle |
| Human-in-the-Loop (HITL) Oversight | Human monitors agent actions; intervenes on anomalies | High-stakes domains (financial, medical) | Continuous — interventions inform model updates |
| HITL Approval | Agent proposes action → human reviews → approves/rejects → agent executes | Actions with significant consequences | Approval data becomes training signal |
| Implicit Feedback | Track whether the user’s problem was resolved (follow-up queries, satisfaction signal) | All production agents | Weekly rollup informs prompt and skill tuning |
| Active Learning | Flag low-confidence queries (say <0.7) for human labeling; use labels to fine-tune or update prompts | High-volume domains where labeling is feasible | Batch update monthly; 50–80% reduction in labeling cost over time |
7. Infrastructure, Deployment & Performance
This area covers the operational foundation: where models run, how agent systems are deployed, how performance is maintained at scale, and how costs are managed. The key insight for agent systems: infrastructure decisions are tightly coupled to model decisions (Area 4). Model cascading, tool parallelism, and RAG caching all have infrastructure implications.
7.1 Where Models Run
The decision of where to run models — cloud API, hybrid, or on-premises — is driven by volume, latency requirements, data sovereignty, and cost at scale.
The volume ranges below are illustrative reference points and should be calibrated based on workload characteristics, latency requirements, and cost constraints
| Volume | Burst Pattern | Recommended Deployment | Best For |
| <500K queries/day | Unpredictable (10× spikes) | Managed Cloud APIs | Elastic scaling; no GPU management; pay per token |
| 500K–2M/day | Sustained with <3× peaks | Hybrid: on-prem for baseline + cloud for burst | Predictable baseline on-prem; cloud absorbs burst traffic |
| >2M/day | Predictable, sustained | On-premises (GPU cluster) | Predictable baseline on-prem; cloud absorbs burst traffic |
| Data sovereignty required | Any | On-premises or private cloud | Data should remain within network boundaries due to regulatory constraints |
7.2 Serving Infrastructure
For on-premises or hybrid deployments, the model serving layer is a critical engineering decision. It affects throughput, latency, and GPU utilization.
| Serving Framework | Best For | Key Feature | Consideration |
| vLLM | High throughput serving of open models | PagedAttention — near-optimal GPU memory utilization | Good choice for production open model serving |
| TGI (Text Generation Inference) | Hugging Face model ecosystem | Optimized for transformer architectures | Good defaults; less tuning flexibility than vLLM |
| Ollama | Developer machines, small-scale local inference | Simple setup, model management | Not designed for production-scale throughput |
| Managed Cloud Endpoints (Azure AI, AWS Bedrock, OpenAI, Anthropic, Vertex AI ) | Managed model serving | Fully managed scaling and availability | Limited control over infra and tuning |
7.3 GPU Memory Management for Multi-Model and Cascading Systems
Model cascading means multiple models should be available simultaneously. Loading all models simultaneously into GPU memory is often inefficient for multi-model systems.
The following strategies manage this:
| Strategy | How It Works | When to Use | Trade-off |
| Model Parallelism | Split one large model across multiple GPUs | Single frontier model >80GB | Uses all GPUs efficiently; adds inter-GPU communication latency |
| GPU Time-Sharing (MPS) | Multiple models share GPU via NVIDIA MPS | 2–3 small/mid models on one GPU | Simple; models compete for compute — latency unpredictable under load |
| Dedicated Pools | Small models on GPU Pool A; frontier models on Pool B | Clear tier separation, predictable latency | Requires more GPUs but gives SLO-predictable performance |
| Offload to CPU | Keep model weights in CPU memory; load to GPU on demand | Infrequently used specialist models | Cold-start latency on first call (seconds); subsequent calls are warm |
These strategies are commonly used in multi-model and cascading architectures where multiple model tiers should be available concurrently.
Note:
For cascading architectures, dedicated pools are a common approach. Small/mid-tier models run on a shared pool with auto-scaling; frontier models run on a dedicated, always-warm pool, which provides a predictable latency for each tier.
7.4 Cold Start & Warm-Up
LLM serving may have cold-start latency: the first request after a model is loaded (or a container starts) may take significantly longer than subsequent requests due to model weight loading and JIT compilation. This directly impacts the p95/p99 latency SLOs.
| Approach | How It Works | Latency Impact | Cost |
| Always-warm pool | Keep at least 1 instance of each model tier always running | Eliminates cold start | Baseline cost even at zero traffic |
| Pre-warm on scale-out | When auto-scaler adds a pod, it pre-warms the model before accepting traffic | Hides cold start behind scale-out delay | Slight over-provisioning |
| Health-check warm-up | Container health check includes a model inference call; pod is not marked ready until warm | Traffic never hits a cold instance | Slower pod readiness (~10–30s) |
SLO Planning:
Typically, p95 SLO definition should account for cold-start latency. For example, if the warm inference p95 is 800ms but cold start adds 3s, then p95 SLO should reflect the expected mix of cold and warm calls based on the traffic pattern.
7.5 Deployment Strategies
| Strategy | How It Works | Best For | Rollback |
| Blue-Green | Two identical environments; traffic switches atomically | Mission-critical systems (life/safety) | Instant (<10s) switch back to blue |
| Canary | Route traffic gradually: 10% % → 50% → 100%, with automated checks at each stage | Standard production deployments | Automatic rollback if SLO gates fail at any stage |
| Shadow | New version receives 100% of traffic in parallel but responses are not shown to users | High-stakes validation before any user exposure | No rollback needed — shadow is read-only |
7.6 CI/CD Pipeline — 4 Gates
Every deployment should pass four automated gates. Any gate failure blocks the deployment.
The values below are illustrative reference points and should be calibrated using evaluation data, deployment risk, and acceptable error thresholds.
| Gate | What It Checks | Pass Criteria | Failure Action |
| Gate 1: Evaluation | Model quality on gold-standard dataset | Task completion >95%, hallucination rate <2%, error rate <1% | Block deployment; alert ML team |
| Gate 2: Security | Adversarial and compliance checks | 0 critical findings, 0 PII leaks, 0 prompt injection successes | Block deployment; alert security team |
| Gate 3: Performance | Latency, cost, throughput vs baseline | p95 latency <2× baseline, cost <1.5× baseline, throughput >0.8× baseline | Block if SLO breached; flag if within 20% of threshold |
| Gate 4: Synthetic Load Simulation | Full workflow simulation (realistic tool latency, external responses) under synthetic load | Workflow completion >95%, handoff loss <1%, 0 deadlocks detected | Block deployment; investigating coordination issues |
7.7 Automated Rollback Triggers
Rollback should be automatic for critical metrics. Defining trigger thresholds in advance prevents delayed detection during incidents.
The triggers below are illustrative reference points and should be calibrated based on system behavior, acceptable risk thresholds, and operational requirements.
| Metric | Rollback Trigger
(illustrative) |
Detection Window (Indicative) | Action |
| Error Rate | >3% (5× baseline) | 5 minutes | Immediate rollback to previous version |
| Latency p95 | >5s (sustained across at least 3 monitoring intervals) | 10 minutes | Immediate rollback |
| Hallucination Rate | >5% (2.5× baseline) | 15 minutes | Rollback + alert + incident ticket |
| Task Completion | <85% (10% drop) | 30 minutes | Rollback + investigation |
| Cost per Query | >3× baseline (sustained) | 1 hour | Alert; rollback if not explained by traffic change |
7.8 SLI/SLO Targets
Service Level Indicators (SLIs) are measurable metrics. Service Level Objectives (SLOs) are targets that have been defined.
The values given below are illustrative reference points and should be calibrated based on workload behavior, system performance targets, and operational constraints.
| Metric (SLI) | Example Target (SLO) | Measurement Window | Notes |
| Task Completion Rate | >95% | Rolling 24h | End-to-end: query received → resolution delivered |
| Response Latency (p95) | <2 seconds | Rolling 1h | Excludes streaming scenarios |
| Error Rate | <1% | Rolling 1h | System errors + tool failures + timeouts |
| Escalation Rate | <10% | Rolling 24h | Queries requiring human intervention |
| Model Latency (p95) | SLM <400ms, LLM <1.8s | Rolling 1h | Per-model-tier, excludes network |
| Tool Success Rate | >98% | Rolling 1h | Including retries, excluding circuit-broken tools |
| Cache Hit Rate | >70% | Rolling 1h | RAG + prompt caching combined |
| Availability | >99.9% | Rolling 30 days | System uptime (health check based) |
Note: These are starting points. A customer-facing chatbot may need p95 <1s; an internal research tool may tolerate p95 <5s. Calibrate against the user experience expectations.
7.9 Cost Optimization — Prioritized by ROI
The prioritization and expected savings below are indicative and should be adjusted based on workload characteristics, cost drivers, and operational constraints.
| Optimization | Expected Savings (indicative)
|
When to Implement | Complexity |
| Model Cascading | 60–70% model cost reduction | Often one of the first optimizations implemented. | Medium (requires confidence measurement) |
| Prompt Caching | 40–60% on repeated prompts | When cache hit rate >50% is achievable | Low (provider feature) |
| RAG Cache | 30–50% on retrieval-heavy workflows | When >70% cache hit achievable | Medium (cache invalidation logic) |
| Prompt Optimization | 20–30% token reduction | When prompts exceed 2K tokens | Low (iterative refinement) |
| Tool Parallelization | 2–3× throughput (not cost, but $/throughput) | When 5+ independent tools per workflow | Medium (dependency graph analysis) |
| Spot Instances | 30–60% compute cost on non-critical tasks | For batch eval, non-real-time agent tasks | Medium (spot interruption handling) |
⚠ Anti-pattern
Deploying frontier models for all workloads without routing or cascading, leading to unnecessary costs and latency increases.
8. Security, Governance & Monitoring
Security for agentic AI systems extends beyond traditional application security because agents can execute actions, not just return text. A compromised or manipulated agent can modify databases, send messages, execute code, or call external APIs. This area covers the threat model, the security architecture, and the monitoring required to maintain trust in production.
8.1 Threat Model: MAESTRO 7-Layer Framework
MAESTRO (Model, Agent, Ecosystem, Security, Trust, Risk, Operations) is a layered threat modeling approach designed specifically for agent-based systems. It defines seven layers of threat surface specific to agentic AI systems. Each layer requires distinct controls.
| Layer | Threat Surface | Primary Controls |
| 1. Foundation Models | Model manipulation via adversarial prompts | 3-layer guardrails (input/processing/output), hallucination detection, output validation |
| 2. Data Operations | Data poisoning, PII exposure, unauthorized access | PII detection at ingress/egress, 4-tier data classification, encryption at rest and in transit |
| 3. Agent Frameworks | Tool abuse, parameter manipulation, sandbox escape | Tool-level RBAC, parameter validation against schema, execution sandboxing |
| 4. Infrastructure | Network interception, credential theft | mTLS between all services, per-agent service accounts, secrets vault (HashiCorp Vault), 7-day cert rotation |
| 5. Evaluation | Undetected degradation, drift | Confidence-based escalation, continuous drift detection, adversarial testing |
| 6. Compliance | Audit gaps, unauthorized autonomous action | 3-tier autonomy levels, immutable audit logs, 4-gate approval for changes |
| 7. Ecosystem | Supply chain attacks, compromised dependencies | Dependency scanning, security trimming of unused capabilities, handoff monitoring |
8.2 Zero Trust Architecture for Agents
Traditional perimeter security assumes that entities within the network are trusted. In agent-based systems, where agents interact with external APIs and other agents, this assumption may not hold. Zero-trust principles require that each request is verified, regardless of its origin.
Per-Agent Identity
- Every agent instance has a unique identity (service account) not a shared credential.
- All inter-agent and agent-to-service communication uses mTLS (mutual TLS) both sides prove identity.
- Certificates rotate every 7 days. Short-lived credentials limit the blast radius of a compromised agent.
3-Tier Autonomy Levels
| Tier | Autonomy Level | Examples | Control (illustration) |
| T1: Autonomous | Agent acts without approval | Read operations, low-risk queries, informational responses | Confidence should be >0.8 |
| T2: Supervised | Agent acts, human reviews afterward | Write operations, moderate-impact actions | Confidence should be >0.6; flagged for review |
| T3: Approved | Agent proposes, human approves before execution | Financial transactions, data deletion, external communications | Humans should approve before any action |
Autonomy Assignment:
Autonomy tiers are typically defined at the tool or action level rather than at the agent level. This allows a single agent to operate at different autonomy levels depending on the task (for example, read operations (T1) vs. write operations (T3)), and helps prevent overly broad permissions.
8.3 Cross-Agent Trust & Lateral Movement Prevention
If Agent A is compromised (via prompt injection or other attack), it should not be able to affect Agent B or access Agent B’s data. Lateral movement prevention is a fundamental requirement for multi-agent systems.
| Control | How It Works | Effect |
| Network Segmentation | Each agent runs in its own network segment; inter-agent traffic routes through a controlled gateway | Compromised agent cannot directly reach other agents or their data stores |
| Token Scoping | Each agent’s authentication token grants access only to its own tools and data — not shared resources | Compromised credentials cannot be used to access other agents’ resources |
| Message Validation | All inter-agent messages are validated against schema and signed with the sender’s identity | Impersonation is detected before the message is processed |
| Delegation Audit | Every delegation is logged in full context (who delegated, what task, what permissions were granted) | Post-incident forensics can trace the exact delegation chain that led to compromise |
8.4 Data Classification
Data classification determines what controls apply to each piece of data the agent handles. Classification should be applied before data enters the agent system — not after.
- Public: No restrictions beyond rate limiting. Freely accessible, non-sensitive information.
- Internal: Requires authentication. Employee-facing data. Agents may access but should not exfiltrate to external systems.
- Confidential: Restricted access with pre-query filtering. Only agents with explicit permission may retrieve. All access is audit-logged.
- Restricted: Requires human approval for agent access. Output filtering applied before returning to agent. Cryptographic audit logs with 7-year retention. Right-to-be-forgotten requests should be processable.
Note: GDPR Compliance Challenge: immutable audit logs (for compliance) conflict with right-to-be-forgotten (for privacy). Resolution: audit logs record that access occurred and who performed it — but not the data content itself. Data content is stored separately and can be deleted independently.
8.5 Guardrails: Input, Processing, Output
Guardrails are the real-time controls that prevent agents from producing harmful, incorrect, or non-compliant outputs. They operate at three points in the pipeline:
| Stage | What It Catches | Controls |
| Input | Prompt injection, PII in user input, off-topic requests | Injection classifier, PII scanner, intent validator — all run before the LLM sees the input |
| Processing | Context window overflow, tool abuse, reasoning loops | Context budget enforcement, tool access RBAC, max-iteration guards on reasoning loops |
| Output | Hallucinated content, PII in response, unsafe content | Hallucination detector (Example: confidence <0.6 → flag), PII scanner on output, content safety classifier |
8.6 Hallucination Prevention & Detection
The techniques below represent commonly used approaches for reducing hallucinations and improving response reliability.
| Control Type | Target (Indicative) | Method |
| Prevention | Near-zero tolerance for action-triggering outputs | RAG grounding (all factual claims should be traceable to source), structured output enforcement, citation requirements |
| Prevention | <2% for informational text | Few-shot examples of grounded responses, explicit ‘if unsure, say so’ instructions |
| Detection | >80% agreement among judges | 3-judge ensemble: run output through 3 independent evaluators; flag if <2 agree it is correct |
| Detection | Deterministic checks | For structured outputs: validate against schema, check consistency with retrieved context, flag numerical claims for verification |
8.7 Monitoring & Observability Architecture
Effective monitoring requires visibility across infrastructure, model behavior, workflow execution, and business outcomes.
| Layer | Metrics | Refresh Rate
(Indicative) |
Audience |
| Infrastructure | CPU/GPU utilization, memory, network I/O, vector DB latency | Every 5 minutes | Platform engineers |
| Component | Model latency (p50/p95/p99), token cost, tool success rate, cache hit rate
Retrieval precision / recall |
Every 5 minutes | ML engineers |
| Workflow | Task completion rate, handoff latency, coordination overhead, error rate by workflow type | Every 5 minutes | Application owners |
| Business | User satisfaction (implicit signals), cost per resolution, compliance pass rate, escalation rate | Daily rollup | Product & leadership |
8.8 AI Incident Response Playbook
Incident response processes are designed to detect, contain, and recover from failures in agent workflows. These processes support system reliability and enable issues to be addressed in a controlled and observable manner.
AI-specific failures often require specialized response procedures. The following is an illustrative playbook covering common production incidents.
Incident Type 1: Hallucination Detected in Production
- Immediate action: The affected workflow is typically flagged. If the hallucination has triggered an external action (for example, write, send, or charge), compensation actions such as rollback, cancellation, or refund may be initiated.
- Short-term (within ~1 hour): Output guardrail sensitivity may be increased, and the hallucinated pattern can be added to adversarial or evaluation test suites.
- Long-term (within ~1 week): Root cause is analyzed, such as retrieval errors (incorrect RAG context) or model capability gaps, and the affected component is updated accordingly.
Incident Type 2: Prompt Injection Detected
- Immediate action: The source of the injection (user account, API key, or input channel) is typically isolated or blocked, and the attack payload is logged for analysis.
- Short-term (within ~1 hour): Actions taken by the agent following the injection are reviewed, and any unauthorized actions may be reversed.
- Long-term (within ~1 week): Detection mechanisms are updated to incorporate the new attack pattern, and the scenario is added to red team or adversarial testing suites.
Incident Type 3: Model Drift Causing Incorrect Outputs
- Immediate action: When drift is confirmed (for example, sustained metric degradation), systems may revert to a previously stable model version.
- Short-term (within ~1 hour): Regression testing is performed on the previous version to confirm stability, and recent changes in model behavior or context are analyzed.
- Long-term (within ~1 week): Drift detection sensitivity may be adjusted, and additional monitoring or canary strategies can be introduced for the affected model tier.
Production Readiness Checklist
Use this checklist across both documents before declaring production ready. Each item maps to a specific design area and section.
| Area | Checklist Item | Verified |
| 0 | Agentic AI suitability assessed using the defined evaluation criteria | ☐ |
| 1 | Each agent has single measurable goal | ☐ |
| 1 | Skills have defined contracts (input/output/error/SLA/fallback) | ☐ |
| 1 | Skill routing method selected and benchmarked | ☐ |
| 1 | Context budget defined and enforced | ☐ |
| 1 | All skill outputs are schema-validated | ☐ |
| 2 | RAG-vs-no-RAG decision made | ☐ |
| 2 | Embedding model benchmarked on domain queries | ☐ |
| 2 | Chunking strategy matched to content type | ☐ |
| 2 | Index maintenance strategy defined (rebuild/upsert schedule) | ☐ |
| 2 | Memory types separated (working ≠ episodic) | ☐ |
| 3 | Architecture pattern chosen (monolith/modular/microservices) | ☐ |
| 3 | Partial failure handling defined (saga/checkpoint/compensation) | ☐ |
| 3 | Reasoning pattern selected with defined iteration limits | ☐ |
| 3 | Framework vs API decision made and benchmarked | ☐ |
| 4 | Model cascading implemented with confidence measurement | ☐ |
| 4 | Tool registry operational with lifecycle management | ☐ |
| 4 | LLM Gateway deployed with cost attribution tagging | ☐ |
| 4 | Retry and circuit-breaker mechanisms configured for tool calls | ☐ |
| 5 | Communication, coordination patterns selected | ☐ |
| 5 | Deadlock prevention controls in place (depth limit, timeout, graph check) | ☐ |
| 5 | Confidence-based escalation thresholds calibrated | ☐ |
| 5 | Consensus mechanism defined for multi-agent decisions | ☐ |
| 6 | Evaluation dataset bootstrapped with ground truth | ☐ |
| 6 | Multi-step workflow evaluation includes intermediate step checks | ☐ |
| 6 | Regression detection configured with statistical significance tests | ☐ |
| 6 | Adversarial testing suite operational (injection, jailbreak, tool abuse) | ☐ |
| 6 | At least one feedback loop active (HITL or implicit) | ☐ |
| 7 | Model deployment location decided based on volume | ☐ |
| 7 | GPU memory management strategy defined for multi-model setup | ☐ |
| 7 | Cold-start handling implemented (always-warm or pre-warm) | ☐ |
| 7 | 4-gate CI/CD pipeline operational | ☐ |
| 7 | Automated rollback triggers configured | ☐ |
| 7 | SLOs defined and monitored | ☐ |
| 8 | MAESTRO 7-layer threat model addressed | ☐ |
| 8 | Zero trust implemented (per-agent identity, mTLS) | ☐ |
| 8 | 3-tier autonomy assigned per tool/action | ☐ |
| 8 | Cross-agent lateral movement prevention in place | ☐ |
| 8 | Data classification applied before data enters agent system | ☐ |
| 8 | Guardrails active at input, processing, and output stages | ☐ |
| 8 | 4-layer monitoring dashboards operational | ☐ |
| 8 | AI incident response playbook documented and rehearsed | ☐ |
The difference between a pilot and a production system is systematic decision-making across these eight interconnected design areas.
Key Takeaways
- Multi-agent systems require structured communication protocols and explicit coordination patterns.
- Evaluation should assess workflow behavior, retrieval quality, and tool execution, not only model output.
- Infrastructure architecture should support model routing, GPU resource management, and autoscaling.
- Security controls should address tool permissions, adversarial prompts, and policy enforcement.
- Observability and human oversight are essential for operational reliability and governance.
Conclusion
Agentic AI marks a transition from isolated model inference to systems where models reason, collaborate, and act within broader software architectures. In this environment, the reliability of the system depends not only on model capability but on how reasoning, coordination, execution, and governance are structured.
Part-1 of the Enterprise Agentic AI guidance addressed the foundational architecture for agents: defining goals, managing context and memory, designing retrieval pipelines, and integrating tools and models. Part-2 focused on the operational architecture required to run these systems in production — including communication protocols, evaluation frameworks, infrastructure patterns, and security controls.
Together, these two articles provide practical architectural guidance for building and operating Enterprise Agentic AI systems.
As organizations begin to deploy agent-driven workflows, success will depend on applying the same engineering discipline used in distributed systems design. Clear architectural boundaries, measurable evaluation practices, and strong governance mechanisms will determine whether agentic AI becomes a reliable enterprise capability or remains an experimental technology.
Lavanya Subbarayalu is an enterprise AI architect specializing in large-scale AI platforms, agentic systems, and responsible AI design. She focuses on building practical, production-ready architectures that enable scalable, reliable, and governed deployment of AI systems across enterprise environments.
