Background Agent¶
Autonomous task processor with checkpointing, monitoring, and resource budgets.
Book reference: Chapter 6 - Agent Architecture, Sections 1 and 6
What This Demonstrates¶
Background agents execute well-defined tasks without human supervision. No one is watching. No one is waiting. They are triggered by schedules, events, or API calls -- not human messages.
This example implements the six background agent design patterns from the book:
- Idempotency - Safe to run the same task twice (idempotency keys prevent duplicates)
- Checkpointing - Saves progress after each task; resumes from interruption
- Alerting - Dead man's switch, failure escalation, budget alarms
- Audit logging - Every task start, completion, and failure is logged
- Graceful degradation - Failed tasks are quarantined; processing continues
- Resource budgeting - Token and cost limits prevent runaway execution
Files¶
| File | Purpose |
|---|---|
agent.py |
Main processing loop with retry logic and budget enforcement |
tasks.py |
Task definitions, queue management, and checkpointing |
monitor.py |
Monitoring, alerting, and the dead man's switch pattern |
config.py |
Configuration via environment variables |
Uses the shared provider library at examples/shared/ for LLM access via OpenRouter.
Quick Start¶
# Install dependencies
pip install -r requirements.txt
# Configure
cp .env.example .env
# Edit .env and add your OpenRouter API key (https://openrouter.ai/keys)
# Run
python agent.py
How It Works¶
[Load checkpoint if exists]
|
v
[For each pending task:]
|
+-- Check budget --> [STOP if exhausted]
|
+-- Record heartbeat
|
+-- Process task via LLM (OpenRouter)
| |
| +-- Success --> [Mark completed, log result]
| |
| +-- Failure --> [Retry with backoff]
| |
| +-- Max retries --> [Quarantine task, alert]
|
+-- Save checkpoint
|
v
[Report summary]
The agent processes a queue of tasks sequentially. After each task (success or failure), it saves a checkpoint to disk. If the process crashes, restarting picks up from the last checkpoint. Tasks that exceed the retry limit are quarantined rather than blocking the entire queue.
Key Design Decisions¶
- Provider abstraction via
shared/library. Swap providers by changingLLM_PROVIDERin.env. - File-based checkpointing for simplicity. Production systems should use Temporal, Redis, or a database.
- Idempotency keys on each task prevent duplicate processing across restarts.
- Token and cost budgets stop the agent before costs spiral. From the book: "This isn't optional -- it's the difference between a manageable mistake and a resignation letter."
- Dead man's switch monitors heartbeats. If the agent goes silent, the monitor fires a critical alert.
- Temperature 0.0 for deterministic outputs on batch tasks.
- Async architecture using
asynciofor non-blocking LLM calls.
Configuration¶
| Variable | Default | Description |
|---|---|---|
OPENROUTER_API_KEY |
(required) | OpenRouter API key |
LLM_PROVIDER |
openrouter |
Provider name (openrouter or openai) |
MODEL |
google/gemini-2.5-flash |
Model to use |
BG_AGENT_MAX_RETRIES |
3 |
Max retries per task |
BG_AGENT_TIMEOUT |
300 |
Task timeout in seconds |
BG_AGENT_POLL_INTERVAL |
10 |
Polling interval in seconds |
BG_AGENT_TOKEN_BUDGET |
50000 |
Max tokens per run |
BG_AGENT_COST_BUDGET |
1.00 |
Max cost (USD) per run |
Sample Tasks¶
The demo includes five sample tasks across three types:
summarize- Condense text into one sentenceanalyze_sentiment- Classify sentiment with confidence scoreextract_entities- Pull named entities from text
Extending This Example¶
- Replace file-based queue with Redis, SQS, or Temporal workflows
- Add a scheduler (cron, APScheduler) for periodic execution
- Connect the monitor to Slack, PagerDuty, or a dashboard
- Add parallel task processing with asyncio.gather()
- Implement exponential backoff between retries