Agent Observability: Understanding How AI Agents Work
Comprehensive metrics for AI agent behavior, performance, and cost tracking.
The previous posts covered how AI requests flow through the system—routing, grounding, guardrails, evaluation, explainability, and governance. But there’s a critical question we haven’t addressed: How do we know if the AI agent is actually working well? This final post explores agent observability—the metrics and insights that reveal whether your AI workflows are performing as intended.
The Problem: LLM Metrics Don’t Capture Agent Behavior
Here’s a fundamental insight that drives everything in this post: standard LLM metrics are necessary but not sufficient for AI agents. Traditional evaluation metrics like hallucination detection, toxicity scoring, and answer relevancy measure the quality of individual responses—but agents do much more than generate text.
The Gap: An agent might produce high-quality responses (low hallucination, high relevance) but still fail at its actual task because it selected the wrong tools, deviated from the expected workflow, took 10 steps instead of 3, or never actually completed what was asked. Standard LLM evaluation misses all of this.
When we analyzed commercial platforms like Galileo AI against our implementation, we discovered a critical gap: we had strong governance, excellent guardrails, and comprehensive evaluation—but we lacked agent-specific observability. Our platform was 90% complete, missing the 10% that matters most for understanding agent behavior.
What Standard LLM Metrics Miss
- Flow adherence: Did the agent follow the intended workflow?
- Task completion: Did the agent actually finish what was asked?
- Tool selection quality: Did it pick the right tools for the job?
- Efficiency: How many steps did it take vs. optimal?
- Multi-step reasoning: How does it perform across complex workflows?
- Self-correction: Can it recover from errors?
Why Agent Observability Matters
AI agents are complex systems. Unlike traditional software where inputs directly produce outputs, AI agents make decisions, select tools, reason through multi-step problems, and sometimes correct themselves mid-execution. Without observability, these workflows are black boxes.
The Challenge: How do you debug an AI agent that took 12 steps instead of 3? Or identify why costs spiked 5x on certain queries? Or determine if an agent is learning bad habits? You need visibility into the agent’s decision-making process itself.
Seven Critical Dimensions of Agent Observability
Production LLM monitoring requires going far beyond simple metrics. Based on industry research and real-world experience, comprehensive agent observability spans seven dimensions:
- Request Tracing & Lineage: Understand the FULL journey of each request through the system
- Token Usage & Cost Attribution: AI costs can spiral without granular tracking by user, team, model
- Prompt Engineering & Experimentation: Version control and A/B testing for prompts
- Quality & Correctness Metrics: Groundedness, relevance, hallucination detection
- Performance & Latency: End-to-end timing, component latencies, time-to-first-token
- Error Tracking & Debugging: Full context for rapid diagnosis and resolution
- Business Metrics & ROI: Connect AI usage to business outcomes
Agent observability answers questions like:
- Is the agent following expected workflows? Flow adherence metrics
- Is it being efficient? Steps taken vs. optimal path
- Is it completing tasks successfully? Task completion evaluation
- What does it cost to run? Token usage and cost tracking
- Is it improving over time? Session-level trends
- Can it recover from errors? Self-correction detection
Trust Insight: Agent observability builds trust by making AI behavior visible and measurable. When you can see exactly what an agent did, why it did it, and how well it performed, you move from hoping the AI works to knowing it works.
Core Metrics Architecture
The Agent Metrics system tracks multiple dimensions of agent behavior:
1. Flow Adherence: Did the Agent Follow the Plan?
When an AI agent receives a task, there’s typically an expected workflow—a sequence of steps that should be followed. Flow adherence measures how well the actual execution matched expectations.
Trajectory Matching
The system supports multiple matching strategies depending on how strict you need the comparison:
Exact Match
Steps must match perfectly in order and content
In-Order Match
Expected steps must appear in order, extra steps allowed
Any-Order Match
All expected actions present, order doesn’t matter
Levenshtein Ratio
Fuzzy matching based on sequence similarity
Visualizing Flow Adherence
Example: Document Processing Workflow
Flow Adherence: 0.89 (IN_ORDER) | Agent added validation step
2. Agent Efficiency: Steps vs. Optimal Path
Efficiency measures whether the agent is taking the shortest reasonable path to complete a task. An efficient agent uses fewer resources while achieving the same outcome.
// Efficiency calculation
efficiency = optimal_steps / actual_steps
// Interpretation:
// > 1.0 = Agent was more efficient than expected
// = 1.0 = Perfect efficiency
// < 1.0 = Agent took extra steps
Why Efficiency Matters: In AI systems, every step costs tokens, which cost money. An agent that takes 10 steps instead of 3 isn’t just slower—it’s 3x more expensive. Tracking efficiency helps identify agents that need optimization or prompts that cause excessive reasoning loops.
3. Tool Call Accuracy
Agents often have access to multiple tools (APIs, databases, external services). Tool call accuracy measures two things:
- Selection Accuracy: Did the agent choose the right tools?
- Success Rate: Did those tool calls succeed?
// Combined accuracy calculation
tool_selection_accuracy = correct_tools / expected_tools
success_rate = successful_calls / total_calls
// Final accuracy (weighted average)
accuracy = (success_rate * 0.5) + (tool_selection_accuracy * 0.5)
4. Task Completion: Did It Actually Work?
The ultimate question: did the agent accomplish what was asked? Task completion evaluation uses multiple signals.
Heuristic Evaluation
# Error patterns (reduce confidence)
error_patterns = ["error", "failed", "exception",
"could not", "unable to", "timeout"]
# Success patterns (increase confidence)
success_patterns = ["completed", "successfully", "done",
"finished", "achieved"]
# Confidence calculation
confidence = (
has_success * 0.4 +
not_has_error * 0.3 +
has_substantial_response * 0.2 +
user_approved * 0.1
)
LLM-as-Judge: AI Evaluating AI
For more nuanced evaluation, the system can use an LLM to judge task completion quality:
LLM-as-Judge Evaluation Criteria
Task Completion
Did the agent fully complete the requested task?
0-10Accuracy
Is the response accurate and correct?
0-10Relevance
Is the response relevant to the task?
0-10Completeness
Did the agent address all aspects?
0-105. Token Usage and Cost Tracking
AI costs money. Every prompt token, every completion token has a price. Tracking costs at the request, session, and aggregate level enables budget management and optimization.
Model Pricing Reference
| Model | Prompt (per 1K tokens) | Completion (per 1K tokens) |
|---|---|---|
| GPT-4 | $0.030 | $0.060 |
| GPT-4o | $0.005 | $0.015 |
| Claude 3 Opus | $0.015 | $0.075 |
| Claude 3.5 Sonnet | $0.003 | $0.015 |
| Gemini 1.5 Pro | $0.00125 | $0.005 |
| Local Models (Llama, Mistral) | $0.00 | $0.00 |
Cost Optimization Strategy: The metrics system enables intelligent model selection. Simple queries can route to cheaper models (GPT-4o-mini, local models) while complex tasks use premium models. This can reduce costs by 70-90% without impacting quality on appropriate tasks.
6. Session Tracking: Multi-Turn Conversations
Real AI interactions aren’t single requests—they’re conversations. Session tracking aggregates metrics across multiple turns.
Session Metrics Over Multiple Turns
Turn 1
Turn 2
Turn 3
Session Total
7. Reasoning Analysis
Modern AI agents use chain-of-thought reasoning to solve complex problems. The metrics system tracks each reasoning step:
// Reasoning step analysis
reasoning_analysis = {
"total_steps": 5,
"total_tokens": 3200,
"total_latency_ms": 4500,
"avg_tokens_per_step": 640,
"avg_latency_per_step": 900,
"actions_taken": 3,
"observations_received": 4,
"action_rate": 0.6 // 60% of steps involved actions
}
8. Self-Correction Detection
Good agents recognize and recover from errors. The metrics system tracks self-correction events:
// Self-correction analysis
correction_analysis = {
"total_corrections": 2,
"successful_corrections": 2,
"correction_success_rate": 1.0,
"failed_corrections": 0,
"error_types": {
"Invalid API response format": 1,
"Missing required field": 1
}
}
Trust Insight: An agent that can identify errors and correct them builds trust. Self-correction metrics show whether your AI is resilient (catches and fixes issues) or brittle (fails on edge cases). A high correction success rate indicates a robust agent.
Real-World Debugging Scenarios
Agent observability isn’t just about collecting metrics—it’s about solving real problems. Here are common debugging scenarios where these metrics prove their value:
Scenario 1: Agent Fails to Complete Task
Problem: Users report that the AI “doesn’t finish” or gives incomplete answers.
// Metrics reveal the root cause
request_id: "failed_req_123"
workflow_metrics: {
"flow_adherence": 0.4, // Deviated significantly from plan
"agent_efficiency": 0.6, // Took too many steps
"tool_call_accuracy": 0.3 // Wrong tools selected
}
completion_metrics: {
"completed": false,
"confidence": 0.2,
"failure_reason": "Error detected in response"
}
Finding: Flow adherence of 0.4 means the agent deviated significantly from the expected workflow. Tool call accuracy of 0.3 indicates wrong tools were selected multiple times.
Action: Review planner/routing logic—the agent is choosing wrong execution paths.
Scenario 2: Costs Spiking Unexpectedly
Problem: AI costs jumped 5x with no obvious cause.
// Aggregate metrics reveal inefficiency
aggregate: {
"total_requests": 500,
"average_efficiency": 0.65, // Below 0.7 threshold
"avg_tokens_per_request": 8500, // Previously was 3200
"self_corrections": 3.2 // Average retries per request
}
Finding: Efficiency dropped to 0.65, meaning agents are taking ~35% more steps than optimal. Self-corrections averaging 3.2 per request shows frequent error-retry loops.
Action: Recent prompt changes likely introduced regressions. Roll back and test.
Scenario 3: Quality Degradation Over Time
Problem: User satisfaction trending down despite no code changes.
// Week-over-week trend analysis
trends: {
"week_1": { completion_rate: 0.94, thumbs_up_ratio: 0.89 },
"week_2": { completion_rate: 0.88, thumbs_up_ratio: 0.82 },
"week_3": { completion_rate: 0.82, thumbs_up_ratio: 0.71 } // ⚠️ Below threshold
}
Finding: Completion rate dropped below 0.9 threshold. Could indicate model updates from provider, data drift, or changing user expectations.
Action: Investigate recent model updates from providers. Check if query types have shifted.
Aggregate Insights
Individual request metrics are useful, but aggregate views reveal system-wide patterns:
// Aggregate metrics across all requests
aggregate_metrics = {
// Volume
"total_requests": 10420,
"total_sessions": 2340,
// Quality
"average_flow_adherence": 0.89,
"average_efficiency": 0.94,
"completion_rate": 0.96,
// Performance
"average_latency_ms": 1250,
"latency_p95_ms": 3200,
// Cost
"total_cost_usd": 892.45,
"avg_cost_per_request": 0.086,
// User satisfaction
"thumbs_up_count": 1248,
"thumbs_down_count": 312,
"net_sentiment": 936
}
How We Implemented This
The metrics described in this post aren’t theoretical—they’re running code. Our proof-of-concept includes a dedicated Agent Metrics Service (port 8007) with ~950 lines of production code implementing all these capabilities.
The Strategic Context: As AI agents move from experiments to production, the operating model must evolve. Traditional process automation frameworks are insufficient for governing probabilistic AI cognition. For a deeper exploration of this shift, see Agentic Operating Model.
Our Implementation: agent-metrics-service
The core engine implements all metrics discussed above. Here’s actual code from
agent-metrics-service/app/core/metrics_engine.py:
# Actual implementation: Model pricing for cost tracking
MODEL_PRICING = {
# OpenAI
"gpt-4": {"prompt": 0.03, "completion": 0.06},
"gpt-4o": {"prompt": 0.005, "completion": 0.015},
"gpt-4o-mini": {"prompt": 0.00015, "completion": 0.0006},
# Anthropic Claude
"claude-3-opus": {"prompt": 0.015, "completion": 0.075},
"claude-3.5-sonnet": {"prompt": 0.003, "completion": 0.015},
# Google Gemini
"gemini-1.5-pro": {"prompt": 0.00125, "completion": 0.005},
# Local models (free)
"llama3": {"prompt": 0.0, "completion": 0.0},
"mistral": {"prompt": 0.0, "completion": 0.0},
}
Flow Adherence: Actual Algorithm
Our implementation uses Levenshtein sequence matching from the python-Levenshtein
library for trajectory comparison:
# From metrics_engine.py - actual flow adherence calculation
from Levenshtein import ratio as levenshtein_ratio
def calculate_flow_adherence(
expected_trajectory: List[str],
actual_trajectory: List[str],
match_type: TrajectoryMatchType = TrajectoryMatchType.IN_ORDER
) -> float:
"""
Calculate how well actual execution matched expected workflow.
Supports EXACT, IN_ORDER, and ANY_ORDER matching.
"""
if match_type == TrajectoryMatchType.EXACT:
return 1.0 if expected == actual else 0.0
elif match_type == TrajectoryMatchType.IN_ORDER:
# Levenshtein ratio for sequence similarity
expected_str = " -> ".join(expected_trajectory)
actual_str = " -> ".join(actual_trajectory)
return levenshtein_ratio(expected_str, actual_str)
elif match_type == TrajectoryMatchType.ANY_ORDER:
# Set intersection for unordered matching
expected_set = set(expected_trajectory)
actual_set = set(actual_trajectory)
intersection = expected_set & actual_set
return len(intersection) / len(expected_set)
Observability Dashboard
We built an Orchestration Dashboard (ORCHESTRATION_DASHBOARD.html) that visualizes
these metrics in real-time. The dashboard connects to Prometheus metrics exposed by the
orchestration service and displays:
- Request volume and latency: p50, p95, p99 response times
- Model selection distribution: Which models are being used
- Cost tracking: Token usage and USD cost per model
- Semantic cache hit rate: Savings from cached responses
- Quality scores: From LLM-as-Judge evaluations
See It Live: The dashboard is available at
ORCHESTRATION_DASHBOARD.html in the project root. It’s a self-contained HTML
file that polls Prometheus endpoints and renders live metrics using Chart.js.
Service Architecture
The implementation spans multiple microservices, each handling a specific concern:
| Service | Port | What It Does |
|---|---|---|
| agent-metrics-service | 8007 | Flow adherence, efficiency, cost, session tracking |
| eval-service | 8005 | G-Eval, RAG metrics, LLM-as-Judge |
| guardrails-service | 8001 | PII, toxicity, bias, injection detection |
| orchestration-service | 8002 | Prometheus metrics, OpenTelemetry tracing |
Distributed Tracing with OpenTelemetry
Every service implements OpenTelemetry tracing for cross-service request correlation.
From */app/core/tracing.py:
# OpenTelemetry setup (same pattern across all services)
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
def setup_tracing(app, service_name: str):
resource = Resource.create({SERVICE_NAME: service_name})
tracer_provider = TracerProvider(resource=resource)
# Export to Tempo/Jaeger
otlp_exporter = OTLPSpanExporter(endpoint="tempo:4317")
tracer_provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
# Auto-instrument FastAPI
FastAPIInstrumentor.instrument_app(app)
What you can trace: A single request ID flows through orchestration → guardrails → knowledge → LLM → eval. Each hop is a span with timing and metadata.
Prometheus + Grafana Stack
The POC includes a full observability stack via docker-compose.observability.yml:
# Observability stack components
prometheus: # Scrapes /metrics from all services
grafana: # Dashboards for visualization
tempo: # Distributed trace storage
loki: # Log aggregation
# prometheus.yml - scrape config
scrape_configs:
- job_name: 'orchestration'
static_configs:
- targets: ['orchestration:8002']
- job_name: 'agent-metrics'
static_configs:
- targets: ['agent-metrics:8007']
What You Can See: Request latency p50/p95/p99, token usage by model, cache hit rates, error rates by service—all in real-time Grafana dashboards.
Non-Blocking Integration Pattern
A key architectural decision: metrics collection never blocks the response path. The orchestration service uses fire-and-forget async calls:
# From orchestration-service/app/clients/agent_metrics_client.py
async def track_workflow(request_id: str, metrics: Dict):
"""Non-blocking metrics submission."""
try:
# Fire and forget - don't await completion
asyncio.create_task(
_submit_metrics(request_id, metrics)
)
except Exception:
# Metrics failures never impact user response
logger.warning("Metrics submission failed - continuing")
pass
Comparison: What Commercial Platforms Offer
When researching agent observability, we analyzed commercial platforms like Galileo AI to understand industry best practices. Here’s how different approaches compare:
| Capability | Traditional LLM Eval | Agent Observability |
|---|---|---|
| Response Quality | Hallucination, Toxicity, Relevance | Same + Workflow Context |
| Flow Tracking | None | Full trajectory + adherence scoring |
| Task Completion | Inferred from response | Explicit evaluation with confidence |
| Tool Call Analysis | Not tracked | Selection accuracy + success rates |
| Cost Attribution | Per-request only | By user, session, workflow, model |
| Self-Correction | Not visible | Tracked with success rates |
| Debugging | Response-level only | Step-by-step workflow visibility |
Key Insight: Commercial platforms like Galileo AI pioneered agent-specific metrics because they recognized the same gap we did—traditional LLM evaluation doesn’t tell you if an agent is actually working. Their Luna-2 SLMs enable 100% sampling at sub-200ms latency. The approach here achieves similar capabilities through careful algorithm selection and local processing.
What We’ve Covered So Far
This post completes the core operational architecture. We’ve explored:
- Intelligent Routing: Right model for the right task
- Knowledge Grounding: Responses anchored in verified information
- Guardrails & Evaluation: Safety and quality assurance
- Responsible AI: Explainability and fairness
- Governance: Policy enforcement and audit trails
- Agent Observability: Comprehensive visibility and metrics
But building trust in AI is an ongoing journey, not a destination. The next three posts expand into advanced territory: continuous learning systems that improve without retraining, enterprise-scale patterns for multi-tenant deployments, and translating all of this into business value that executives can measure.
The Complete Trust Picture: Trust in AI isn’t built on hope—it’s built on evidence. With comprehensive observability, you can demonstrate that your AI agents route requests intelligently, ground responses in verified knowledge, respect safety guardrails, provide explainable decisions, comply with governance policies, and perform efficiently and effectively. That’s the foundation—now let’s build on it.
Coming Up Next
In Part 7, we explore Continuous Learning—how AI systems can improve from feedback without expensive fine-tuning cycles, using approaches like the ACE (Agentic Context Engineering) framework.
Leave a comment