Observability Example¶

Companion code for Chapter 4: Infrastructure for AI-First Operations --- The Infrastructure Stack

AI-specific metrics collection, dashboard definitions, and alert rules. Tracks the four metric categories the book identifies as non-optional from day one: latency, tokens, cost, and quality.

What This Demonstrates¶

Traditional APM tools track HTTP latency and error rates. AI workloads need different signals:

Latency --- Per-model response times with p50/p95/p99 percentile breakdowns. A slow model isn't a 500 error, but it kills user experience.
Tokens --- Input and output token counts per model without which you cannot explain cost or optimize prompts.
Cost --- Per-request and aggregate AI spend, broken down by model, feature, and team. The number your CFO cares about.
Quality --- Optional quality scores for AI responses. The only signal that tells you whether the AI is actually working.

Without early tracking, AI costs grow faster than usage and you miss optimization opportunities that compound over time.

File Structure¶

observability/
├── metrics.py                 # AI metrics collector (latency, tokens, cost, quality)
├── alerts.py                  # Alert rules and threshold evaluation
├── dashboards/
│   └── ai-ops.json            # Dashboard definition for AI operations
└── requirements.txt

Quick Start¶

# No external dependencies required
python metrics.py    # Run the metrics demo
python alerts.py     # Run the alerts demo

Metrics Demo¶

metrics.py simulates 50 requests across two models and shows:

Cost breakdown --- Total spend, per-model cost, cost by feature tag
Latency summary --- p50/p95/p99 per model
Tag-level cost attribution --- Which features and teams are driving spend

Alert Rules¶

alerts.py includes default alert rules for the most common AI failure modes:

Alert	Severity	Threshold	What It Catches
Daily budget warning	WARNING	> $50/day	Unexpected cost growth
Daily budget critical	CRITICAL	> $100/day	Budget breach
Cost per request spike	WARNING	> $0.10/req	Runaway agent or prompt regression
P95 latency warning	WARNING	> 5,000ms	Model or provider slowdown
P99 latency critical	CRITICAL	> 10,000ms	Severe performance degradation
Quality degradation	WARNING	< 0.7 score	Model drift or prompt issues
Quality critical drop	CRITICAL	< 0.5 score	Possible model failure
High token usage	WARNING	> 1M tokens	Token consumption spike

Dashboard Definition¶

dashboards/ai-ops.json defines a complete AI operations dashboard with five rows:

Cost Overview --- Total spend, cost per request, breakdowns by model and feature
Latency --- p50/p95/p99 time series, latency distribution histogram
Token Usage --- Input/output by model, tokens per request, input/output ratio
Quality --- Quality scores over time, quality by feature
Operational Health --- Request volume, error rate, provider status, active alerts

Import this into Grafana, Datadog, or your preferred dashboard tool. Adjust metric names to match your exporter format.

Production Notes¶

This example uses in-memory storage. In production:

Export metrics via Prometheus client, OpenTelemetry, or Datadog SDK
Use Helicone or Langfuse for LLM-specific observability (built-in caching can cut costs 20-30%)
Send alerts to PagerDuty, Slack, or your incident management platform
Store metric history in a time-series database for trend analysis

AI Gateway Example --- The gateway that generates these metrics
Unified Auth Example --- Auth events feed into observability
Infrastructure Audit Checklist
5 Infrastructure Mistakes