Chapter 9: Data Strategy — Flywheels, Moats, and Ethics — Resources¶

Curated resources for deeper exploration of topics covered in this chapter.

Frameworks from This Chapter¶

Data Flywheel — The 5-component flywheel (Collection, Storage, Analysis, Application, Feedback) and how to build self-reinforcing data loops that compound advantage.
Data Moats — The moat test framework for evaluating whether your data advantage is measured in days, months, or years of defensibility.
6 Data Strategy Mistakes — Six patterns that kill data flywheels, from building before product-market fit to platform risk and single-point dependencies.

Tools & Platforms¶

PostgreSQL — Recommended starting database for AI-first companies; Notion manages 200 billion blocks on sharded Postgres; OpenAI runs PostgreSQL as the backbone for ChatGPT (referenced in Section 4: Polyglot Persistence)
Redis — Cache layer for speed-critical data like session state and rate limiting (referenced in Section 4: Polyglot Persistence)
Vector Databases — Usage grew 377% in 2024 across enterprises per Databricks; stores embeddings for semantic search (referenced in Section 4: Polyglot Persistence)
LexisNexis — Harvey's exclusive partnership for primary law databases and Shepard's Citations that competitors cannot obtain (referenced in Sections 1 and 3)
NVIDIA Data Flywheel Blueprint — Achieved inference cost reductions up to 98.6% while maintaining comparable accuracy (referenced in Section 5: Privacy by Design)
Hugging Face HuggingChat — Demonstrates consent-by-design approach; conversations remain private and are never used for training (referenced in Section 5: Privacy by Design)
Apple Differential Privacy — Local differential privacy adding noise on-device before transmission; used for Genmoji improvements and QuickType suggestions (referenced in Section 5: Privacy by Design)
Duality Technologies — Federated learning with embedded Privacy Enhancing Technologies including secure aggregation and configurable privacy-accuracy balances (referenced in Section 5: Privacy by Design)
Mistral AI — All services run exclusively within the EU; refuses to train on customer data without explicit consent; open-source models enable data sovereignty (referenced in Section 5: Privacy by Design)

Research & Data¶

AWS CDO Survey — 93% of CDOs say data strategy matters for GenAI, yet 57% haven't made necessary changes
Appen State of AI Report 2024 — Data accuracy in the U.S. declined from 63.5% in 2021 to 26.6% in 2024; 48% identify data management as most significant obstacle
MIT NANDA Study — 95% of enterprise AI pilots fail to reach production with measurable value
Nature: Model Collapse Research — AI trained on AI-generated content degrades over time; models exhibit "narrower range of output over time" when trained recursively
DLA Piper GDPR Survey — GDPR fines reached EUR 1.2 billion in 2024, up 38% from the previous year
CNIL Guidance 2025 — French regulator clarified that AI training on personal data from public sources can use legitimate interest but requires documentation before training begins
EU AI Act — High-risk system compliance deadline: August 2, 2026; fines up to EUR 35 million or 7% of global revenue
AT&T/NVIDIA Case Study — Fine-tuned models achieved 94% accuracy vs 78% for generic GPT-4 by activating existing customer service data
Perplexity Growth Data — 312M queries (May 2024) to 1.4B (June 2025); DAU/MAU ratio of 53% far exceeds benchmarks
AI Startup Failure Analysis — 92% failure rate within 18 months; 43% built products nobody wanted; 60-70% of AI wrappers generate zero revenue

Community & Learning¶

GDPR Compliance Resources — Key requirements for AI systems include legitimate processing basis, data protection impact assessments, transparency, and deletion rights
CCPA/CPRA (California) — Expands "sharing" to include behavioral advertising; penalties reach $7,988 per intentional violation
Eight New U.S. State Privacy Laws (2025) — Delaware, Iowa, Nebraska, New Hampshire, New Jersey, Tennessee, Minnesota, and Maryland
Hugging Face Model Cards — Standardized documentation covering training datasets, performance metrics, known biases, and intended uses; serve as "boundary objects" accessible across disciplines

Companies Referenced in This Chapter¶

Harvey, IBM (Watson Health), Spotify, Duolingo, Klarna, Tesla, Cursor, AT&T, Perplexity, Notion, Discord, OpenAI, Glean, fileAI, Stitch Fix, Mistral AI, Hugging Face, Apple, NVIDIA, Duality Technologies, DuckDuckGo, Yirifi, Waymo, Ghost Autonomy