Building Trust in AI Part 6 of 9

Agent Observability: Understanding How AI Agents Work

Comprehensive metrics for AI agent behavior, performance, and cost tracking.

The previous posts covered how AI requests flow through the system—routing, grounding, guardrails, evaluation, explainability, and governance. But there’s a critical question we haven’t addressed: How do we know if the AI agent is actually working well? This final post explores agent observability—the metrics and insights that reveal whether your AI workflows are performing as intended.

The Problem: LLM Metrics Don’t Capture Agent Behavior

Here’s a fundamental insight that drives everything in this post: standard LLM metrics are necessary but not sufficient for AI agents. Traditional evaluation metrics like hallucination detection, toxicity scoring, and answer relevancy measure the quality of individual responses—but agents do much more than generate text.

The Gap: An agent might produce high-quality responses (low hallucination, high relevance) but still fail at its actual task because it selected the wrong tools, deviated from the expected workflow, took 10 steps instead of 3, or never actually completed what was asked. Standard LLM evaluation misses all of this.

When we analyzed commercial platforms like Galileo AI against our implementation, we discovered a critical gap: we had strong governance, excellent guardrails, and comprehensive evaluation—but we lacked agent-specific observability. Our platform was 90% complete, missing the 10% that matters most for understanding agent behavior.

What Standard LLM Metrics Miss

  • Flow adherence: Did the agent follow the intended workflow?
  • Task completion: Did the agent actually finish what was asked?
  • Tool selection quality: Did it pick the right tools for the job?
  • Efficiency: How many steps did it take vs. optimal?
  • Multi-step reasoning: How does it perform across complex workflows?
  • Self-correction: Can it recover from errors?

Why Agent Observability Matters

AI agents are complex systems. Unlike traditional software where inputs directly produce outputs, AI agents make decisions, select tools, reason through multi-step problems, and sometimes correct themselves mid-execution. Without observability, these workflows are black boxes.

The Challenge: How do you debug an AI agent that took 12 steps instead of 3? Or identify why costs spiked 5x on certain queries? Or determine if an agent is learning bad habits? You need visibility into the agent’s decision-making process itself.

Seven Critical Dimensions of Agent Observability

Production LLM monitoring requires going far beyond simple metrics. Based on industry research and real-world experience, comprehensive agent observability spans seven dimensions:

  1. Request Tracing & Lineage: Understand the FULL journey of each request through the system
  2. Token Usage & Cost Attribution: AI costs can spiral without granular tracking by user, team, model
  3. Prompt Engineering & Experimentation: Version control and A/B testing for prompts
  4. Quality & Correctness Metrics: Groundedness, relevance, hallucination detection
  5. Performance & Latency: End-to-end timing, component latencies, time-to-first-token
  6. Error Tracking & Debugging: Full context for rapid diagnosis and resolution
  7. Business Metrics & ROI: Connect AI usage to business outcomes

Agent observability answers questions like:

  • Is the agent following expected workflows? Flow adherence metrics
  • Is it being efficient? Steps taken vs. optimal path
  • Is it completing tasks successfully? Task completion evaluation
  • What does it cost to run? Token usage and cost tracking
  • Is it improving over time? Session-level trends
  • Can it recover from errors? Self-correction detection

Trust Insight: Agent observability builds trust by making AI behavior visible and measurable. When you can see exactly what an agent did, why it did it, and how well it performed, you move from hoping the AI works to knowing it works.

Core Metrics Architecture

The Agent Metrics system tracks multiple dimensions of agent behavior:

Agent Metrics Pipeline ──────────────────────── Agent Execution │ ▼ ┌──────────────────────────────────────────┐ │ METRICS ENGINE │ │ │ │ 1. Flow Adherence │ │ └─▶ Trajectory matching algorithms │ │ │ │ 2. Task Completion │ │ └─▶ Heuristic + LLM-as-Judge │ │ │ │ 3. Token & Cost Tracking │ │ └─▶ Per-model pricing calculations │ │ │ │ 4. Session Aggregation │ │ └─▶ Multi-turn conversation metrics │ │ │ │ 5. Reasoning Analysis │ │ └─▶ Step-by-step trace evaluation │ │ │ │ 6. Self-Correction Detection │ │ └─▶ Error recovery tracking │ │ │ └──────────────────────────────────────────┘ │ ▼ Dashboards, Alerts, Reports

1. Flow Adherence: Did the Agent Follow the Plan?

When an AI agent receives a task, there’s typically an expected workflow—a sequence of steps that should be followed. Flow adherence measures how well the actual execution matched expectations.

Trajectory Matching

The system supports multiple matching strategies depending on how strict you need the comparison:

Exact Match

Steps must match perfectly in order and content

score = (expected == actual) ? 1.0 : 0.0

In-Order Match

Expected steps must appear in order, extra steps allowed

score = matched_steps / expected_steps

Any-Order Match

All expected actions present, order doesn’t matter

score = intersection / expected_set

Levenshtein Ratio

Fuzzy matching based on sequence similarity

score = levenshtein_ratio(expected, actual)

Visualizing Flow Adherence

Example: Document Processing Workflow

1 parse_document Expected & Matched
2 extract_entities Expected & Matched
+ validate_format Extra step (agent added)
3 generate_summary Expected & Matched
4 store_results Expected & Matched

Flow Adherence: 0.89 (IN_ORDER) | Agent added validation step

2. Agent Efficiency: Steps vs. Optimal Path

Efficiency measures whether the agent is taking the shortest reasonable path to complete a task. An efficient agent uses fewer resources while achieving the same outcome.

// Efficiency calculation
efficiency = optimal_steps / actual_steps

// Interpretation:
// > 1.0 = Agent was more efficient than expected
// = 1.0 = Perfect efficiency
// < 1.0 = Agent took extra steps

Why Efficiency Matters: In AI systems, every step costs tokens, which cost money. An agent that takes 10 steps instead of 3 isn’t just slower—it’s 3x more expensive. Tracking efficiency helps identify agents that need optimization or prompts that cause excessive reasoning loops.

3. Tool Call Accuracy

Agents often have access to multiple tools (APIs, databases, external services). Tool call accuracy measures two things:

  1. Selection Accuracy: Did the agent choose the right tools?
  2. Success Rate: Did those tool calls succeed?
// Combined accuracy calculation
tool_selection_accuracy = correct_tools / expected_tools
success_rate = successful_calls / total_calls

// Final accuracy (weighted average)
accuracy = (success_rate * 0.5) + (tool_selection_accuracy * 0.5)

4. Task Completion: Did It Actually Work?

The ultimate question: did the agent accomplish what was asked? Task completion evaluation uses multiple signals.

Heuristic Evaluation

# Error patterns (reduce confidence)
error_patterns = ["error", "failed", "exception",
                  "could not", "unable to", "timeout"]

# Success patterns (increase confidence)
success_patterns = ["completed", "successfully", "done",
                    "finished", "achieved"]

# Confidence calculation
confidence = (
    has_success * 0.4 +
    not_has_error * 0.3 +
    has_substantial_response * 0.2 +
    user_approved * 0.1
)

LLM-as-Judge: AI Evaluating AI

For more nuanced evaluation, the system can use an LLM to judge task completion quality:

LLM-as-Judge Evaluation Criteria

Task Completion

Did the agent fully complete the requested task?

0-10
Accuracy

Is the response accurate and correct?

0-10
Relevance

Is the response relevant to the task?

0-10
Completeness

Did the agent address all aspects?

0-10

5. Token Usage and Cost Tracking

AI costs money. Every prompt token, every completion token has a price. Tracking costs at the request, session, and aggregate level enables budget management and optimization.

Model Pricing Reference

Model Prompt (per 1K tokens) Completion (per 1K tokens)
GPT-4 $0.030 $0.060
GPT-4o $0.005 $0.015
Claude 3 Opus $0.015 $0.075
Claude 3.5 Sonnet $0.003 $0.015
Gemini 1.5 Pro $0.00125 $0.005
Local Models (Llama, Mistral) $0.00 $0.00

Cost Optimization Strategy: The metrics system enables intelligent model selection. Simple queries can route to cheaper models (GPT-4o-mini, local models) while complex tasks use premium models. This can reduce costs by 70-90% without impacting quality on appropriate tasks.

6. Session Tracking: Multi-Turn Conversations

Real AI interactions aren’t single requests—they’re conversations. Session tracking aggregates metrics across multiple turns.

Session Metrics Over Multiple Turns

Turn 1
Tokens: 1,234 Cost: $0.024 Latency: 890ms
Turn 2
Tokens: 2,156 Cost: $0.041 Latency: 1,240ms
Turn 3
Tokens: 1,890 Cost: $0.036 Latency: 1,100ms
Session Total
Tokens: 5,280 Cost: $0.101 Avg Latency: 1,077ms

7. Reasoning Analysis

Modern AI agents use chain-of-thought reasoning to solve complex problems. The metrics system tracks each reasoning step:

// Reasoning step analysis
reasoning_analysis = {
    "total_steps": 5,
    "total_tokens": 3200,
    "total_latency_ms": 4500,
    "avg_tokens_per_step": 640,
    "avg_latency_per_step": 900,
    "actions_taken": 3,
    "observations_received": 4,
    "action_rate": 0.6  // 60% of steps involved actions
}

8. Self-Correction Detection

Good agents recognize and recover from errors. The metrics system tracks self-correction events:

// Self-correction analysis
correction_analysis = {
    "total_corrections": 2,
    "successful_corrections": 2,
    "correction_success_rate": 1.0,
    "failed_corrections": 0,
    "error_types": {
        "Invalid API response format": 1,
        "Missing required field": 1
    }
}

Trust Insight: An agent that can identify errors and correct them builds trust. Self-correction metrics show whether your AI is resilient (catches and fixes issues) or brittle (fails on edge cases). A high correction success rate indicates a robust agent.

Real-World Debugging Scenarios

Agent observability isn’t just about collecting metrics—it’s about solving real problems. Here are common debugging scenarios where these metrics prove their value:

Scenario 1: Agent Fails to Complete Task

Problem: Users report that the AI “doesn’t finish” or gives incomplete answers.

// Metrics reveal the root cause
request_id: "failed_req_123"

workflow_metrics: {
    "flow_adherence": 0.4,      // Deviated significantly from plan
    "agent_efficiency": 0.6,    // Took too many steps
    "tool_call_accuracy": 0.3   // Wrong tools selected
}

completion_metrics: {
    "completed": false,
    "confidence": 0.2,
    "failure_reason": "Error detected in response"
}

Finding: Flow adherence of 0.4 means the agent deviated significantly from the expected workflow. Tool call accuracy of 0.3 indicates wrong tools were selected multiple times.

Action: Review planner/routing logic—the agent is choosing wrong execution paths.

Scenario 2: Costs Spiking Unexpectedly

Problem: AI costs jumped 5x with no obvious cause.

// Aggregate metrics reveal inefficiency
aggregate: {
    "total_requests": 500,
    "average_efficiency": 0.65,    // Below 0.7 threshold
    "avg_tokens_per_request": 8500, // Previously was 3200
    "self_corrections": 3.2        // Average retries per request
}

Finding: Efficiency dropped to 0.65, meaning agents are taking ~35% more steps than optimal. Self-corrections averaging 3.2 per request shows frequent error-retry loops.

Action: Recent prompt changes likely introduced regressions. Roll back and test.

Scenario 3: Quality Degradation Over Time

Problem: User satisfaction trending down despite no code changes.

// Week-over-week trend analysis
trends: {
    "week_1": { completion_rate: 0.94, thumbs_up_ratio: 0.89 },
    "week_2": { completion_rate: 0.88, thumbs_up_ratio: 0.82 },
    "week_3": { completion_rate: 0.82, thumbs_up_ratio: 0.71 } // ⚠️ Below threshold
}

Finding: Completion rate dropped below 0.9 threshold. Could indicate model updates from provider, data drift, or changing user expectations.

Action: Investigate recent model updates from providers. Check if query types have shifted.

Aggregate Insights

Individual request metrics are useful, but aggregate views reveal system-wide patterns:

// Aggregate metrics across all requests
aggregate_metrics = {
    // Volume
    "total_requests": 10420,
    "total_sessions": 2340,

    // Quality
    "average_flow_adherence": 0.89,
    "average_efficiency": 0.94,
    "completion_rate": 0.96,

    // Performance
    "average_latency_ms": 1250,
    "latency_p95_ms": 3200,

    // Cost
    "total_cost_usd": 892.45,
    "avg_cost_per_request": 0.086,

    // User satisfaction
    "thumbs_up_count": 1248,
    "thumbs_down_count": 312,
    "net_sentiment": 936
}

How We Implemented This

The metrics described in this post aren’t theoretical—they’re running code. Our proof-of-concept includes a dedicated Agent Metrics Service (port 8007) with ~950 lines of production code implementing all these capabilities.

The Strategic Context: As AI agents move from experiments to production, the operating model must evolve. Traditional process automation frameworks are insufficient for governing probabilistic AI cognition. For a deeper exploration of this shift, see Agentic Operating Model.

Our Implementation: agent-metrics-service

The core engine implements all metrics discussed above. Here’s actual code from agent-metrics-service/app/core/metrics_engine.py:

# Actual implementation: Model pricing for cost tracking
MODEL_PRICING = {
    # OpenAI
    "gpt-4": {"prompt": 0.03, "completion": 0.06},
    "gpt-4o": {"prompt": 0.005, "completion": 0.015},
    "gpt-4o-mini": {"prompt": 0.00015, "completion": 0.0006},
    # Anthropic Claude
    "claude-3-opus": {"prompt": 0.015, "completion": 0.075},
    "claude-3.5-sonnet": {"prompt": 0.003, "completion": 0.015},
    # Google Gemini
    "gemini-1.5-pro": {"prompt": 0.00125, "completion": 0.005},
    # Local models (free)
    "llama3": {"prompt": 0.0, "completion": 0.0},
    "mistral": {"prompt": 0.0, "completion": 0.0},
}

Flow Adherence: Actual Algorithm

Our implementation uses Levenshtein sequence matching from the python-Levenshtein library for trajectory comparison:

# From metrics_engine.py - actual flow adherence calculation
from Levenshtein import ratio as levenshtein_ratio

def calculate_flow_adherence(
    expected_trajectory: List[str],
    actual_trajectory: List[str],
    match_type: TrajectoryMatchType = TrajectoryMatchType.IN_ORDER
) -> float:
    """
    Calculate how well actual execution matched expected workflow.
    Supports EXACT, IN_ORDER, and ANY_ORDER matching.
    """
    if match_type == TrajectoryMatchType.EXACT:
        return 1.0 if expected == actual else 0.0

    elif match_type == TrajectoryMatchType.IN_ORDER:
        # Levenshtein ratio for sequence similarity
        expected_str = " -> ".join(expected_trajectory)
        actual_str = " -> ".join(actual_trajectory)
        return levenshtein_ratio(expected_str, actual_str)

    elif match_type == TrajectoryMatchType.ANY_ORDER:
        # Set intersection for unordered matching
        expected_set = set(expected_trajectory)
        actual_set = set(actual_trajectory)
        intersection = expected_set & actual_set
        return len(intersection) / len(expected_set)

Observability Dashboard

We built an Orchestration Dashboard (ORCHESTRATION_DASHBOARD.html) that visualizes these metrics in real-time. The dashboard connects to Prometheus metrics exposed by the orchestration service and displays:

  • Request volume and latency: p50, p95, p99 response times
  • Model selection distribution: Which models are being used
  • Cost tracking: Token usage and USD cost per model
  • Semantic cache hit rate: Savings from cached responses
  • Quality scores: From LLM-as-Judge evaluations

See It Live: The dashboard is available at ORCHESTRATION_DASHBOARD.html in the project root. It’s a self-contained HTML file that polls Prometheus endpoints and renders live metrics using Chart.js.

Service Architecture

The implementation spans multiple microservices, each handling a specific concern:

Service Port What It Does
agent-metrics-service 8007 Flow adherence, efficiency, cost, session tracking
eval-service 8005 G-Eval, RAG metrics, LLM-as-Judge
guardrails-service 8001 PII, toxicity, bias, injection detection
orchestration-service 8002 Prometheus metrics, OpenTelemetry tracing

Distributed Tracing with OpenTelemetry

Every service implements OpenTelemetry tracing for cross-service request correlation. From */app/core/tracing.py:

# OpenTelemetry setup (same pattern across all services)
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

def setup_tracing(app, service_name: str):
    resource = Resource.create({SERVICE_NAME: service_name})
    tracer_provider = TracerProvider(resource=resource)

    # Export to Tempo/Jaeger
    otlp_exporter = OTLPSpanExporter(endpoint="tempo:4317")
    tracer_provider.add_span_processor(BatchSpanProcessor(otlp_exporter))

    # Auto-instrument FastAPI
    FastAPIInstrumentor.instrument_app(app)

What you can trace: A single request ID flows through orchestration → guardrails → knowledge → LLM → eval. Each hop is a span with timing and metadata.

Prometheus + Grafana Stack

The POC includes a full observability stack via docker-compose.observability.yml:

# Observability stack components
prometheus:    # Scrapes /metrics from all services
grafana:       # Dashboards for visualization
tempo:         # Distributed trace storage
loki:          # Log aggregation

# prometheus.yml - scrape config
scrape_configs:
  - job_name: 'orchestration'
    static_configs:
      - targets: ['orchestration:8002']
  - job_name: 'agent-metrics'
    static_configs:
      - targets: ['agent-metrics:8007']

What You Can See: Request latency p50/p95/p99, token usage by model, cache hit rates, error rates by service—all in real-time Grafana dashboards.

Non-Blocking Integration Pattern

A key architectural decision: metrics collection never blocks the response path. The orchestration service uses fire-and-forget async calls:

# From orchestration-service/app/clients/agent_metrics_client.py
async def track_workflow(request_id: str, metrics: Dict):
    """Non-blocking metrics submission."""
    try:
        # Fire and forget - don't await completion
        asyncio.create_task(
            _submit_metrics(request_id, metrics)
        )
    except Exception:
        # Metrics failures never impact user response
        logger.warning("Metrics submission failed - continuing")
        pass

Comparison: What Commercial Platforms Offer

When researching agent observability, we analyzed commercial platforms like Galileo AI to understand industry best practices. Here’s how different approaches compare:

Capability Traditional LLM Eval Agent Observability
Response Quality Hallucination, Toxicity, Relevance Same + Workflow Context
Flow Tracking None Full trajectory + adherence scoring
Task Completion Inferred from response Explicit evaluation with confidence
Tool Call Analysis Not tracked Selection accuracy + success rates
Cost Attribution Per-request only By user, session, workflow, model
Self-Correction Not visible Tracked with success rates
Debugging Response-level only Step-by-step workflow visibility

Key Insight: Commercial platforms like Galileo AI pioneered agent-specific metrics because they recognized the same gap we did—traditional LLM evaluation doesn’t tell you if an agent is actually working. Their Luna-2 SLMs enable 100% sampling at sub-200ms latency. The approach here achieves similar capabilities through careful algorithm selection and local processing.

What We’ve Covered So Far

This post completes the core operational architecture. We’ve explored:

  1. Intelligent Routing: Right model for the right task
  2. Knowledge Grounding: Responses anchored in verified information
  3. Guardrails & Evaluation: Safety and quality assurance
  4. Responsible AI: Explainability and fairness
  5. Governance: Policy enforcement and audit trails
  6. Agent Observability: Comprehensive visibility and metrics

But building trust in AI is an ongoing journey, not a destination. The next three posts expand into advanced territory: continuous learning systems that improve without retraining, enterprise-scale patterns for multi-tenant deployments, and translating all of this into business value that executives can measure.

The Complete Trust Picture: Trust in AI isn’t built on hope—it’s built on evidence. With comprehensive observability, you can demonstrate that your AI agents route requests intelligently, ground responses in verified knowledge, respect safety guardrails, provide explainable decisions, comply with governance policies, and perform efficiently and effectively. That’s the foundation—now let’s build on it.

Coming Up Next

In Part 7, we explore Continuous Learning—how AI systems can improve from feedback without expensive fine-tuning cycles, using approaches like the ACE (Agentic Context Engineering) framework.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.