Guide: Setting Up Polyglot Persistence¶

A step-by-step guide to implementing a multi-database architecture where each workload runs on the database best suited for it --- without over-engineering from day one.

Based on Chapter 4: Infrastructure and Chapter 9: Data Strategy

What You'll Build¶

A polyglot persistence layer that matches each data type to the right database engine. By the end of this guide, you will have:

A clear map of which data belongs in which database type
A phased rollout plan that starts simple and adds complexity only when measured problems demand it
Connection patterns that let your application talk to multiple databases cleanly
A decision framework for when to add the next database to your stack

Prerequisites¶

A running PostgreSQL instance (this is your starting point for everything)
A production application with real user traffic (or an imminent launch)
Basic understanding of your data access patterns --- what gets read most, what gets written most, what needs to be fast
Familiarity with your compliance requirements (some database choices have regulatory implications)

Step 1: Audit Your Current Data Landscape¶

Before adding any database, you need to understand what you already have. Map every data type in your system against these five questions:

Question	If Yes	Database Type
Does this data require ACID transactions?	Relational (PostgreSQL)	User accounts, billing, audit logs
Does the schema change frequently?	Document store (MongoDB, DynamoDB)	Agent conversations, JSON API responses
Is read latency critical (< 10ms)?	Cache layer (Redis, Memcached)	Session state, rate limiting, hot keys
Do you need semantic or similarity search?	Vector database (Pinecone, Weaviate, Qdrant)	Embeddings for RAG, recommendation features
Is this analytics, not operations?	Data warehouse (BigQuery, Snowflake)	Reporting queries, dashboards, ML training

Most teams discover that 80% or more of their data belongs in a relational database. That is fine. The point of polyglot persistence isn't to use every database type --- it is to add the right one at the right time.

Step 2: Start with PostgreSQL for Everything¶

This isn't a compromise. This is the correct starting point.

Notion manages 200 billion block entities on sharded PostgreSQL. OpenAI runs PostgreSQL as the backbone for ChatGPT. When OpenAI needed to handle read-heavy workloads at scale, they didn't replace Postgres --- they optimized it with read replicas and PgBouncer, reducing latency from 50ms to under 5ms.

For your first six months (at minimum):

Put all transactional data in PostgreSQL. User accounts, billing records, application state, configuration --- everything.
Optimize before adding complexity. Proper indexing, connection pooling (PgBouncer or PgPool), read replicas for heavy read workloads.
Measure actual bottlenecks. Do not guess. Use query analysis tools (EXPLAIN ANALYZE) to identify what is actually slow.
Resist the urge to add a second database unless you have a measured problem that PostgreSQL can't solve with tuning.

Step 3: Add a Cache Layer When Latency Becomes Measurable¶

The first database you will likely add is Redis (or a comparable in-memory cache). The trigger: you have specific data that needs sub-10ms reads and your PostgreSQL queries can't deliver that consistently under load.

What goes in the cache: - Session state and authentication tokens - Rate limiting counters - Frequently accessed configuration values - Results of expensive queries that change infrequently

What doesn't go in the cache: - Anything that requires durability guarantees - Data that changes on every request (the cache invalidation overhead negates the benefit) - Entire database tables "just in case"

Implementation pattern: 1. Identify the 3--5 hottest keys in your system (the data read most frequently) 2. Add Redis alongside PostgreSQL --- not as a replacement 3. Implement a read-through cache: check Redis first, fall back to PostgreSQL, populate Redis on miss 4. Set TTLs (time-to-live) that match your consistency requirements --- shorter for fast-changing data, longer for stable reference data 5. Monitor cache hit rates --- if they are below 80%, your cache strategy needs adjustment

Step 4: Add a Document Store When Schema Flexibility Blocks Iteration¶

The trigger here is specific: you have a data type whose schema changes faster than your deployment cycle, and those schema migrations are slowing your team down.

Common candidates for document stores: - Agent conversation histories --- variable structure depending on tools used, context windows, and interaction patterns - JSON-heavy API responses --- especially when you are integrating with third-party systems whose schemas you don't control - User-generated content with variable fields --- forms, surveys, or configuration objects where each record may have different attributes

Key decisions: - MongoDB when you need flexible querying across document fields and your team has relational database experience (the learning curve is gentler) - DynamoDB when you need predictable performance at any scale and your access patterns are well-defined (key-value or key-document lookups)

Warning: Do not use a document store because "flexibility sounds good." Flexibility without structure becomes chaos. If you haven't thought through your data model, the answer is to think harder about the model, not to use a schemaless database.

Step 5: Add a Vector Database When AI Features Require Embeddings¶

This is usually the first genuinely new capability you add --- not a performance optimization, but a new kind of query that relational databases can't serve efficiently.

Vector database usage grew 377% in 2024 across enterprises. The trigger for adding one is straightforward: your product needs semantic search, similarity matching, or RAG (retrieval-augmented generation).

Choosing a vector database: - Pinecone --- managed service, lowest operational overhead, sub-10ms latency at scale. Notion uses Pinecone to power semantic search across billions of embeddings. - Weaviate --- open source, supports hybrid search (vector + keyword), good for teams that want more control - Qdrant --- open source, Rust-based, strong performance characteristics for self-hosted deployments - pgvector --- a PostgreSQL extension. If your embedding volumes are modest (under a few million vectors), this lets you avoid adding another database entirely. Try this first.

Implementation sequence: 1. Start with pgvector if your embedding count is under 2--3 million. This avoids adding infrastructure. 2. If query latency or scale exceeds pgvector's capabilities, migrate to a dedicated vector database. 3. Build your embedding pipeline: generate embeddings at write time, store them in the vector database, query at read time. 4. Implement hybrid search where relevant --- Perplexity AI uses Vespa to combine vector search, lexical search, and BM25 in a single system for their 400 million monthly queries.

Step 6: Add a Data Warehouse When Analytics Queries Impact Production¶

The final trigger: your analytical queries (reports, dashboards, ML training jobs) are slowing down your operational database. This typically happens when you have enough data that aggregate queries scan millions of rows.

The separation principle: Operational workloads (fast reads and writes for your application) and analytical workloads (complex aggregations across large datasets) have fundamentally different performance profiles. Running them on the same database forces trade-offs that hurt both.

Notion discovered this firsthand. Their update-heavy workload (90% updates vs. inserts) made traditional data warehouses expensive. They built a data lake with Kafka CDC to Hudi to S3, achieving over a million dollars in annual savings while reducing ingestion time from days to minutes.

Implementation steps: 1. Set up change data capture (CDC) from your operational database to your analytics layer 2. Choose a destination: a managed data warehouse (BigQuery, Snowflake, Redshift) for SQL-friendly analytics, or a data lake (S3 + query engine) for more flexible processing 3. Establish a clear data freshness SLA --- how stale can analytics data be before it is a problem? 4. Route all reporting and ML training queries to the analytics layer, never to the production database

Key Decisions Summary¶

Decision Point	Recommended Path	Anti-Pattern to Avoid
Starting architecture	PostgreSQL for everything	Spinning up 4+ databases before launch
When to add caching	Measured latency problems under load	"Pre-optimizing" with Redis on day one
When to add documents	Schema changes blocking deploys	Using document stores because "NoSQL scales"
When to add vectors	AI features requiring embeddings	Adding a vector DB without AI use cases
When to add analytics	Reporting queries hurting production	Running dashboards against production DB

The goal isn't architectural elegance. The goal is a data architecture that accelerates your product without creating operational overhead that slows everything else down. Add complexity when you have measured problems, not imagined ones.

Data Strategy Checklist --- Audit your data strategy across collection, quality, flywheel design, moats, and governance
Building Data Flywheels --- How your persistence layer feeds self-reinforcing data loops
Data Moats --- Assessing whether your data architecture creates defensible advantage
5 Infrastructure Mistakes --- Common infrastructure failures to avoid as you scale

Full chapters: Chapter 4: Infrastructure | Chapter 9: Data Strategy