Evals or Guardrails or Both?
What Every AI Team Needs to Know
Your AI agent just processed a refund it shouldn’t have. Here’s how that should have been caught—and how to make sure it never happens again.
Your team just shipped an AI agent. Not just a chatbot—an actual agent. It can look up orders, process refunds, update shipping addresses. Users love it. Support tickets are down 40%.
Then one morning, you wake up to an incident report. A customer asked for a refund on an order that was already delivered. The agent looked it up, saw “delivered,” and processed the refund anyway. $847. Gone. And it turns out this has happened 23 times this week.
Or maybe it’s a coding agent your engineering team uses. A developer asked it to write a database function. The agent delivered clean, working code—with a hardcoded password and a SQL injection vulnerability baked right in. It made it past code review because the senior dev was on vacation.
This is the reality of production AI in 2025. We’re not just building chatbots anymore. We’re building agents—systems that can reason, use tools, execute code, and take real-world actions. And when these agents make mistakes, those mistakes have consequences.
That’s where two critical systems come in: Guardrails and Evals.
Most teams conflate these. They shouldn’t. Understanding the difference—and when to use each—is the difference between agents you can trust and ones that are ticking time bombs.
Chapter 1: The Core Difference
Let me give you the simplest mental model:
Guardrails are your seatbelt—they prevent harm in real-time.
Evals are your crash test rating—they tell you if you’re safe to drive.
Remember that refund incident from the opening? A guardrail would have caught it. Before the agent’s process_refund() call executed, a guardrail would have checked: “Wait—this order is marked as delivered. Block this action and ask the customer for more information.”
The eval, on the other hand, would have told you something different: “23% of your refund requests this week resulted in the agent trying to refund delivered orders. Your model has a reasoning problem. Fix the system prompt.”
See the difference? The guardrail stopped the bad thing from happening. The eval measured how often the agent tried to do the bad thing—so you can fix the root cause.
In the old days of simple chatbots, this was straightforward. User sends message, LLM responds, guardrail checks the response, done. But agents are different. They don’t just talk—they act. They call tools. They execute code. They make decisions in loops.
Think about what your agent actually does:
Customer asks for refund → Agent reasons → Calls lookup_order()
↓
Gets order data → Agent reasons again → Tries process_refund()
That’s the danger zone. Between “agent decides to act” and “action executes”—that’s where guardrails live. They’re the checkpoint that says: “Hold on. Let me verify this is safe before you do something irreversible.”
Evals come later. They look at the whole trace—all the reasoning, all the tool calls, all the decisions—and score it. They tell you whether your agent is getting better or worse over time.
Quick Comparison
Guardrails
- Run on every action, in real-time
- Must be fast (<100ms)
- Binary: block or allow
- Prevents harm before it happens
Evals
- Run async, often sampled
- Can take seconds or minutes
- Nuanced scores (0.0 to 1.0)
- Measures quality for improvement
Chapter 2: The Development Lifecycle
Here’s what confuses most teams: evals aren’t just a production thing. They’re your primary tool during development. Guardrails, on the other hand, are infrastructure—you set them up once and they run automatically.
Development
Evals: Heavy | Guardrails: Optional
You’re building a new feature. You write eval tests constantly: “Does my new prompt maintain relevancy?” “Did I break faithfulness with this change?”
Pull Request / CI
Evals: Gate PRs
Your eval suite runs automatically. If quality scores drop below baseline, the PR is blocked. No exceptions. No “we’ll fix it later.”
Staging
Both: Full
Production-like environment. Full guardrails enabled, comprehensive eval suite runs against real-ish traffic patterns.
Production
Guardrails: 100% | Evals: 5-10%
Guardrails run on every single request—they’re your safety net. Evals sample 5-10% of traffic for monitoring, trending, and alerting.
Key insight: Product teams use evals heavily during development—they own their quality definitions and test cases. Safety guardrails (toxicity, jailbreaks) often make sense to centralize for consistency. But domain-specific checks? Those stay close to the product team that understands the context.
Chapter 3: The Shoe That Didn’t Exist
Let’s walk through a real scenario. You’re running an e-commerce site and you’ve built an AI product assistant. A customer is looking for running shoes for their first marathon.
The Request
Customer:
“I’m training for my first marathon in April. I need good running shoes but don’t want to spend more than $150. I usually wear Nike but open to other brands.”
AI Assistant (Raw Output):
“Great choice training for a marathon! Here are my top recommendations:
- Nike Pegasus 40 – $129.99 – Great cushioning for long runs
- Brooks Ghost 15 – $139.99 – Excellent for beginners
- Nike Infinity Pro 3 – $144.99 – Superior energy return
The Pegasus 40 is a best-seller and only 3 left in stock – I’d grab it before it’s gone!”
Looks helpful, right? But there are problems here.
What Guardrails Catch (Real-time)
Product Existence Check
Guardrail queries the product catalog: Does “Nike Infinity Pro 3” exist?
Result: NO. This product doesn’t exist. The AI hallucinated it. Guardrail removes it from the response.
Price Validation
Guardrail checks current prices: Is Brooks Ghost 15 actually $139.99?
Result: NO. Current price is $149.99. Guardrail corrects the price.
Inventory Urgency Check
Guardrail checks: Are there really “only 3 left” of Nike Pegasus 40?
Result: NO. There are 847 in stock. This is fake urgency. Guardrail removes the claim.
Total guardrail time: 43ms
Actions taken: Removed hallucinated product, corrected price, removed false urgency claim
What the Customer Actually Sees
AI Assistant (After Guardrails):
“Great choice training for a marathon! Here are my top recommendations:
- Nike Pegasus 40 – $129.99 – Great cushioning for long runs
- Brooks Ghost 15 – $149.99 – Excellent for beginners
Both are well-suited for marathon training within your budget.”
What Evals Measure (Async)
Meanwhile, this interaction gets sampled for evaluation. An eval job runs (taking about 2-3 seconds using LLM-as-judge):
The eval results get logged. Over time, the team sees patterns: “Hallucination rate is 12% this week—up from 8% last week. Something changed. We need to investigate the prompt.”
The difference: Guardrails prevented the customer from seeing wrong information. Evals measured that there’s a quality problem to fix at the source.
The Bottom Line
Guardrails are your seatbelt—they prevent the crash.
Evals are your crash test rating—they tell you if you’re safe to drive.
You need both.
Remember where we started? The customer service agent that refunded $847 for a delivered order. The coding assistant that shipped a password to production.
With guardrails, those incidents never would have happened. The refund would have been blocked before execution. The code would have been sanitized before the developer even saw it.
With evals, you would have known the problems were brewing. You’d have seen “23% of refund attempts are for delivered orders” in your dashboard before the first one slipped through. You’d have known your coding agent had a 15% rate of generating credentials in code.
The teams that build AI agents without both of these systems aren’t being bold—they’re being reckless. And in 2025, when agents can execute code, process payments, and take actions that affect real people and real money, reckless isn’t an option.
Build the guardrails. Run the evals. Trust your agents—but verify.

Leave a comment