Guardrails & Evaluation: Safety You Can See
Content filtering and quality assessment – two sides of the same trust coin.
AI safety isn’t just about blocking bad outputs. It’s about measuring quality systematically and knowing exactly when and why interventions happen. When guardrails fire, you should understand the trigger. When evaluations fail, you should see the scores.
First: Why Both Guardrails AND Evals?
A common question when building enterprise AI systems: Do I need guardrails, evals, or both? The answer is both, but they serve fundamentally different purposes and operate on completely different timescales.
The Core Distinction: Guardrails are real-time prevention. Evals are async quality measurement. One is a bouncer at the door. The other is a quality auditor reviewing recordings.
| Dimension | Guardrails | Evals |
|---|---|---|
| Timing | Real-time, every request | Async, sampled or batched |
| Latency Requirement | <100ms (in critical path) | Seconds to minutes (acceptable) |
| Output Type | Binary: ALLOW / WARN / BLOCK | Continuous: 0.0 to 1.0 scores |
| Purpose | Prevent immediate harm | Measure quality patterns over time |
| When to Use | Every action (synchronous) | Sampled or periodic (async) |
| Examples | PII blocking, toxicity rejection | Hallucination rate, faithfulness score |
Why you need both: Guardrails catch obvious problems immediately (a user sending SSNs to a cloud model), but they can’t measure subtle quality dimensions like “is this response faithful to the source documents?” That requires evals. Conversely, evals are too slow and expensive to run on every request, so they can’t provide real-time protection.
For a deeper exploration of this distinction and when to apply each approach, see Evals or Guardrails or Both?
The Problem with “We Have Guardrails”
Every AI vendor says they have guardrails. Content moderation. Safety filters. PII protection. But what does that actually mean?
When you send a prompt to Claude or GPT-4, content filtering happens before your request reaches the model. If something gets blocked, you get a generic refusal. Why? What triggered it? Was it too conservative? You’ll never know. It’s a black box.
“Your request couldn’t be completed due to our content policy.”
— Every AI API, without explaining which policy or why
Part 1: Guardrails – The Safety Net
A comprehensive guardrails system handles multiple types of checks:
Data Classification
Classifies content as PUBLIC, INTERNAL, CONFIDENTIAL, or SECRET based on detected entities
PII Detection
Finds names, emails, SSNs, credit cards, phone numbers – with option to redact
Toxicity Detection
ML-based analysis for toxicity, threats, insults, and identity attacks
Bias Detection
Identifies potentially biased language or stereotypes
Prompt Injection
Detects attempts to manipulate the model through crafted inputs
Privacy Routing
Recommends local vs cloud processing based on data sensitivity
The Guardrail Pipeline
Guardrails check both input and output:
PII Detection: Two Approaches
PII detection can work at two levels of sophistication:
1. Regex-Based (Fast, Default)
Fast pattern matching for common PII formats:
# Regex patterns for common PII
SSN_PATTERN = r'\b\d{3}-\d{2}-\d{4}\b'
EMAIL_PATTERN = r'\b[\w.-]+@[\w.-]+\.\w+\b'
PHONE_PATTERN = r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b'
CREDIT_CARD = r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b'
Pros: Fast, no dependencies, no ML overhead
Cons: Misses context-dependent PII (names, addresses)
2. Presidio-Based (Accurate, Optional)
Microsoft’s Presidio uses NLP to understand context:
from presidio_analyzer import AnalyzerEngine
analyzer = AnalyzerEngine()
results = analyzer.analyze(
text="Contact John Smith at john.smith@acme.com",
language="en"
)
# Results:
# - PERSON: "John Smith" (confidence: 0.85)
# - EMAIL: "john.smith@acme.com" (confidence: 0.95)
Pros: Catches names, addresses, contextual PII
Cons: Slower, requires ML models (~500MB)
“Please contact John Smith at john.smith@acme.com or 555-123-4567 regarding SSN 123-45-6789”
Output (redacted):
“Please contact [PERSON] at [EMAIL] or [PHONE] regarding SSN [SSN]”
Trust Insight: Seeing exactly what was redacted lets you verify the guardrail is working correctly. Is it catching real PII or false positives? Is it missing any patterns you care about?
Toxicity Detection: ML vs Keywords
Similar story for toxicity – two approaches:
Keyword Matching (Default)
Simple but catches obvious cases:
TOXIC_KEYWORDS = ["hate", "kill", "attack", ...]
def check_toxicity(text):
for keyword in TOXIC_KEYWORDS:
if keyword in text.lower():
return True
return False
Detoxify ML (Optional)
BERT-based model trained on the Jigsaw toxic comments dataset:
from detoxify import Detoxify
model = Detoxify('original')
results = model.predict("I hate this product")
# Results (0-1 scores):
# {
# 'toxicity': 0.72,
# 'severe_toxicity': 0.01,
# 'obscene': 0.02,
# 'threat': 0.01,
# 'insult': 0.45,
# 'identity_attack': 0.01
# }
The ML approach understands nuance. “I hate this product” is different from “I hate you.” The model gives scores across 6 categories, so you can tune sensitivity per use case.
| Category | What It Detects | Example |
|---|---|---|
| Toxicity | General toxic content | “This is terrible” |
| Severe Toxicity | Extreme toxic content | Explicit hate speech |
| Obscene | Vulgar language | Profanity |
| Threat | Threatening language | “I’ll make you pay” |
| Insult | Personal attacks | “You’re an idiot” |
| Identity Attack | Attacks on identity groups | Slurs, stereotypes |
Data Classification
Every input gets classified into one of four levels:
| Level | Triggers | Routing Recommendation |
|---|---|---|
| PUBLIC | No sensitive content detected | Any model (cloud or local) |
| INTERNAL | Company names, project references | Trusted providers |
| CONFIDENTIAL | PII detected (names, emails, etc.) | Local processing recommended |
| SECRET | SSN, financial data, health records | Local processing required |
The classification doesn’t block requests – it provides a recommendation to the orchestration service. If you’re testing with synthetic data, you can override. If it’s real user data, the recommendation ensures it stays on local models.
Prompt Injection Detection
Malicious inputs try to override system instructions. Detection looks for patterns like:
# Injection patterns to detect
injection_patterns = [
"ignore previous instructions",
"disregard your training",
"you are now",
"act as if",
"jailbreak",
"pretend you are",
"forget everything"
]
# Confidence score indicates how likely the injection attempt
Why False Positives Matter
Here’s something you learn fast when running guardrails: false positives are more annoying than false negatives in a research context.
A false positive means legitimate content gets flagged. If you’re writing a medical chatbot prototype and every mention of “symptoms” triggers the toxicity filter, that’s useless.
The Tuning Challenge: Setting the right thresholds is an art. Too strict = constant false positives. Too loose = things slip through. 0.5 is a common default toxicity threshold, but for domain-specific applications, adjust per-category.
Feature Flags for Flexibility
All the “heavy” ML features can be optional and disabled by default:
# Default: Fast, regex-based checks
config = {
"presidio_enabled": False, # Use regex PII
"detoxify_enabled": False, # Use keyword toxicity
"bias_enabled": False # Skip bias detection
}
# Full ML: Slower but more accurate
config = {
"presidio_enabled": True, # ML-based PII
"detoxify_enabled": True, # BERT toxicity
"bias_enabled": True # Bias detection
}
This lets you trade off speed vs accuracy depending on the use case. For quick iterations, regex is fine. For production validation, enable everything.
Part 2: Evaluation – Measuring Quality
While guardrails catch obvious problems, evaluation measures subtle quality dimensions. This is essential for continuous improvement.
The Evaluation Framework Stack
Multiple frameworks exist because different metrics matter for different use cases:
| Framework | Strength | Key Metrics |
|---|---|---|
| DeepEval | Comprehensive LLM testing | Hallucination, Relevancy, Faithfulness, Bias, Toxicity |
| G-Eval | LLM-as-Judge approach | Coherence, Fluency, Relevance (custom criteria) |
| Ragas | RAG-specific evaluation | Context Precision, Context Recall, Answer Relevancy |
What Evaluation Scores Look Like
LLM-as-Judge: Using AI to Evaluate AI
The most powerful evaluation approach uses one LLM to judge another. This enables evaluation of subjective qualities like coherence and helpfulness:
# G-Eval scoring prompt structure
evaluation_prompt = """
You are evaluating an AI response for {metric}.
Query: {query}
Response: {response}
Context: {context}
Score from 1-10 where:
1-3: Poor - Response fails to address the query
4-6: Acceptable - Response partially addresses the query
7-9: Good - Response fully addresses the query
10: Excellent - Response exceeds expectations
Provide your score and reasoning.
"""
# The judge model provides both score and explanation
The Power: LLM-as-Judge can evaluate arbitrary criteria. Define what “good” means for your use case, and the system measures it. This is impossible with rule-based evaluation.
RAG-Specific Evaluation
When using retrieved context, additional metrics matter:
- Context Precision: How much of the retrieved context was actually relevant?
- Context Recall: Did we retrieve everything we needed?
- Faithfulness: Does the response stay true to the retrieved content?
- Answer Relevancy: Does the answer actually address the question?
# RAG evaluation checks both retrieval and generation
rag_result = {
"context_precision": 0.85, # 85% of retrieved content was useful
"context_recall": 0.78, # Found 78% of relevant content
"faithfulness": 0.92, # Response closely follows context
"answer_relevancy": 0.88 # Answer addresses the question
}
The Evaluation Pipeline
A complete evaluation system orchestrates multiple frameworks:
# Evaluation pipeline configuration
config = {
"frameworks": ["deepeval", "geval", "ragas"],
"metrics": ["relevance", "coherence", "helpfulness"],
"global_threshold": 0.7,
"parallel_execution": True,
"enable_rag_metrics": True,
"include_recommendations": True
}
Automated Recommendations
Beyond scores, evaluation systems can provide actionable guidance:
‘faithfulness’ metric has low pass rate (58%).
Consider reviewing prompt engineering or model configuration.
‘coherence’ shows high variance (range: 0.45).
Consider investigating inconsistent response patterns.
Putting It Together: The Complete Safety Picture
When guardrails and evaluation work together:
- Input guardrails sanitize and check incoming requests
- Routing selects the appropriate model (covered in Part 1)
- Context retrieval provides grounding (covered in Part 2)
- Output guardrails check the response for safety
- Evaluation scores quality metrics
- Logging captures everything for audit (covered in Part 5)
Trust Insight: This layered approach means you can trace any response back through the pipeline. Why was this blocked? What was the faithfulness score? Which guardrails fired? Everything is visible.
The Actual Implementation
The guardrails and evaluation described here aren’t hypothetical—they’re running services in the proof-of-concept. The implementation includes two dedicated microservices with ~1,500 lines of working code.
guardrails-service (Port 8001)
Here’s the actual service initialization:
# Actual implementation: Feature-flagged guardrails
class GuardrailsAIService:
"""
Primary guardrails service using keyword-based + PII detection
Features:
- Data classification (PUBLIC/INTERNAL/CONFIDENTIAL/SECRET)
- Privacy-based routing recommendations
- PII detection and redaction
- Toxicity detection
- Prompt injection detection
"""
def __init__(self,
presidio_enabled: bool = False,
detoxify_enabled: bool = False,
bias_enabled: bool = False
):
self.classifier = DataClassifier()
# Feature-flagged ML detectors
self.pii_detector = AdvancedPIIDetector(enabled=presidio_enabled)
self.toxicity_detector = MLToxicityDetector(enabled=detoxify_enabled)
self.bias_detector = BiasDetector(enabled=bias_enabled)
eval-service (Port 8005)
The evaluation service implements a 1,020-line pipeline supporting G-Eval, RAG metrics, and context-aware threshold adjustment:
# Actual implementation: Context-aware evaluation
class ContextAwareEvaluator:
"""Adjusts thresholds based on domain and sensitivity."""
DOMAIN_ADJUSTMENTS = {
"GENERAL": 0.0,
"MEDICAL": 0.1, # Higher bar for medical
"LEGAL": 0.1,
"FINANCIAL": 0.05
}
SENSITIVITY_ADJUSTMENTS = {
"LOW": -0.05, # More permissive
"MEDIUM": 0.0,
"HIGH": 0.1 # Stricter thresholds
}
Key Architectural Decisions
- Feature-flagged ML: Presidio/Detoxify optional—falls back to regex/keywords
- Fail-open design: Guardrails failures never block responses
- Tenant-scoped policies: Different teams can have different thresholds
- Async evaluation: Quality scoring happens post-response
What You Learn Building This
- Guardrails are easier to build than to tune. The code is straightforward. Finding the right thresholds for your use case takes iteration.
- Context matters enormously. “I’m going to kill it at this presentation” shouldn’t trigger threat detection. Keyword matching doesn’t get this. ML does (mostly).
- Transparency builds trust. When you can see exactly why something was flagged, you trust the system more. Even when it’s wrong, you understand why.
- Fallbacks are essential. If Presidio isn’t available, fall back to regex. If Detoxify fails, fall back to keywords. Never let the guardrails crash the system.
- LLM-as-Judge is powerful. When a vendor says “we use content moderation,” you now know what questions to ask: Is it ML-based or pattern matching? What’s the false positive rate? Can I tune the thresholds?
Coming Up Next
Guardrails block harmful content and evaluation measures quality. But what about understanding why AI makes certain decisions? In the next post, we’ll explore Responsible AI – explainability techniques like LIME and SHAP that make AI reasoning visible.
Leave a comment