Structured Outputs & Evaluation

Part 1: Structured Outputs

From Free Text to Machine-Readable Data

🎯 Why Structure Matters

LLMs naturally generate unstructured text. But production systems need predictable, typed, validated outputs that can trigger downstream actions, integrate with APIs, populate databases, and flow through software pipelines. Structured outputs transform agents from conversational tools into reliable system components.

❌ Unstructured Problem

Agent says "Create a high-priority ticket for John about the login bug." How does your system parse this? What if it says "urgent" instead of "high"?

✅ Structured Solution

Agent outputs: {"assignee": "john@company.com", "priority": "high", "category": "bug", "title": "Login issue"}. Perfect for API calls!

❌ Unstructured Problem

"The meeting should probably be next Tuesday around 2pm with Sarah and Mike." Ambiguous, unparseable.

✅ Structured Solution

{"date": "2025-10-28", "time": "14:00", "duration_minutes": 60, "attendees": ["sarah@co", "mike@co"]}. Calendar API ready!

🔧 Pydantic: Type-Safe Output Validation

Simple Schema

Basic type validation with required fields

from pydantic import BaseModel

class TicketOutput(BaseModel):
    title: str
    priority: str  # high, medium, low
    assignee: str
    estimated_hours: int

# Validates types automatically
ticket = TicketOutput(**llm_output)

Advanced Validation

Constraints, enums, and custom validators

from pydantic import BaseModel, validator
from typing import List
from enum import Enum

class Priority(Enum):
    HIGH = "high"
    MEDIUM = "medium"
    LOW = "low"

class Ticket(BaseModel):
    title: str
    priority: Priority
    tags: List[str]
    hours: int
    
    @validator('hours')
    def check_hours(cls, v):
        if v < 0 or v > 100:
            raise ValueError('Invalid hours')
        return v

Nested Structures

Complex hierarchical data models

class Action(BaseModel):
    action_type: str
    owner: str
    deadline: str

class MeetingSummary(BaseModel):
    title: str
    date: str
    attendees: List[str]
    action_items: List[Action]
    key_decisions: List[str]

# Validates entire nested structure
summary = MeetingSummary(**llm_output)

Error Handling

Graceful validation failure management

from pydantic import ValidationError

try:
    ticket = TicketOutput(**llm_output)
    # Success: use validated ticket
    create_ticket_in_system(ticket)
    
except ValidationError as e:
    # Validation failed
    print(f"Errors: {e.errors()}")
    # Ask LLM to try again with error details
    retry_with_feedback(e.errors())

🔄 Structured Output Generation Flow

📋

Define Schema

Create Pydantic model with required fields, types, and constraints

🤖

Provide to LLM

Send schema as JSON format instruction via function calling or system prompt

✅

Validate Output

Parse LLM response through Pydantic model to check types and constraints

🔁

Retry if Failed

If validation fails, provide error details to LLM and request corrected output

Part 2: Agent Evaluation

Measuring Performance & Driving Improvement

📦

Final Response Evaluation

(Black Box Testing)

Examine only the end result without looking at internal steps. Like judging a restaurant by the final dish—you don't care how it was made, just if it's good.

✅ Best For

End-to-end testing
User experience validation
Quality benchmarking
A/B testing different agents

❌ Limitations

Can't debug failures
No visibility into reasoning
Hard to identify specific issues

🎯

Single-Step Evaluation

(Unit Testing)

Test individual decisions in isolation—did the agent choose the right tool? Format arguments correctly? Like testing each ingredient separately.

✅ Best For

Tool selection accuracy
Argument validation
Fast iteration during development
Targeted debugging

❌ Limitations

Doesn't validate full task completion
Misses multi-step coordination issues
Can pass tests but fail in practice

🛤️

Trajectory Evaluation

(Flight Recorder)

Trace the complete execution path—every tool call, argument, result, and decision. Like a flight recorder capturing the entire journey, not just takeoff and landing.

✅ Best For

Deep debugging
Understanding agent reasoning
Comparing execution paths
Root cause analysis

❌ Limitations

Most complex to implement
Expensive (requires full logging)
Slower to run

📐 Four Critical Evaluation Dimensions

🎯

1. Task Completion

Did the agent accomplish what the user asked?
Was the goal fully achieved or only partially?
Are there any unfinished sub-tasks?
Did the agent recognize when it couldn't complete something?

✨

2. Quality Control

Is the output in the correct format?
Did the agent follow instructions properly?
Was context used appropriately (not ignored or misapplied)?
Are there factual errors or hallucinations?

🔧

3. Tool Interaction

Did the agent choose the right tools for the task?
Were tool arguments valid and well-formed?
Did tool calls return useful results?
Was the tool sequence logical and efficient?

⚡

4. System Metrics

How fast did the agent respond (latency)?
How many tokens were consumed (cost)?
What's the failure/error rate?
Did the agent hit context window limits?

🤖 LLM-as-Judge: Automated Evaluation

How It Works

Use a separate LLM to evaluate agent outputs
Provide rubric, reference answer, actual output
LLM scores on dimensions (1-5 or pass/fail)
Can evaluate subjective qualities like tone, helpfulness
Much faster and cheaper than human evaluation

Best Practices

Use stronger model as judge (e.g., GPT-4 judging GPT-3.5)
Provide clear scoring criteria and examples
Combine with rule-based checks for objectivity
Validate judge accuracy with human spot-checks
Store judgments for analysis and improvement

🔄 The Continuous Improvement Loop

📊

Measure

Run evaluation across all dimensions using test cases and real traffic

🔍

Analyze

Identify patterns in failures, bottlenecks, and opportunities for improvement

🛠️

Improve

Fix tool descriptions, prompts, validation rules, or add missing capabilities

✅

Validate

Re-run evaluation to confirm improvements, then deploy and repeat

💡 Key Insights from Evaluation

Unclear tool descriptions → Agent chooses wrong tools or uses them incorrectly
Excessive token usage → Need summarization, better memory management, or shorter prompts
Inefficient tool sequences → Agent needs better planning or more direct paths to goals
Low quality scores → Improve prompts, add examples, or use stronger base model
High failure rates → Add retry logic, better error handling, or fallback strategies
Context window overflow → Implement sliding window or summarization for memory