Skip to content

Chapter 9: Data Strategy — Flywheels, Moats, and Ethics — Resources

Curated resources for deeper exploration of topics covered in this chapter.

Frameworks from This Chapter

  • Data Flywheel — The 5-component flywheel (Collection, Storage, Analysis, Application, Feedback) and how to build self-reinforcing data loops that compound advantage.
  • Data Moats — The moat test framework for evaluating whether your data advantage is measured in days, months, or years of defensibility.
  • 6 Data Strategy Mistakes — Six patterns that kill data flywheels, from building before product-market fit to platform risk and single-point dependencies.

Tools & Platforms

  • PostgreSQL — Recommended starting database for AI-first companies; Notion manages 200 billion blocks on sharded Postgres; OpenAI runs PostgreSQL as the backbone for ChatGPT (referenced in Section 4: Polyglot Persistence)
  • Redis — Cache layer for speed-critical data like session state and rate limiting (referenced in Section 4: Polyglot Persistence)
  • Vector Databases — Usage grew 377% in 2024 across enterprises per Databricks; stores embeddings for semantic search (referenced in Section 4: Polyglot Persistence)
  • LexisNexis — Harvey's exclusive partnership for primary law databases and Shepard's Citations that competitors cannot obtain (referenced in Sections 1 and 3)
  • NVIDIA Data Flywheel Blueprint — Achieved inference cost reductions up to 98.6% while maintaining comparable accuracy (referenced in Section 5: Privacy by Design)
  • Hugging Face HuggingChat — Demonstrates consent-by-design approach; conversations remain private and are never used for training (referenced in Section 5: Privacy by Design)
  • Apple Differential Privacy — Local differential privacy adding noise on-device before transmission; used for Genmoji improvements and QuickType suggestions (referenced in Section 5: Privacy by Design)
  • Duality Technologies — Federated learning with embedded Privacy Enhancing Technologies including secure aggregation and configurable privacy-accuracy balances (referenced in Section 5: Privacy by Design)
  • Mistral AI — All services run exclusively within the EU; refuses to train on customer data without explicit consent; open-source models enable data sovereignty (referenced in Section 5: Privacy by Design)

Further Reading

  • Harvey's Cold Start Strategy — How Harvey went from zero proprietary legal data in mid-2022 to $8B valuation and $100M ARR by designing a flywheel architecture before having data
  • IBM Watson Health Failure — Over $4B invested, sold for approximately $1B; more data without a flywheel to compound it
  • Klarna's AI Playbook — 48% revenue growth 2022-2024 while reducing operating expenses by 20%; AI contributed $40M in profitability gains in 2024, but customer service AI eventually plateaued
  • Spotify Data Architecture — 1.4 trillion events daily from 678 million users; multi-task training improves podcast and music discovery; AI users show 40% higher retention and 140 min/day usage vs 99 min
  • Duolingo Birdbrain System — AI estimates probability of correct answers in real-time; drove 59% DAU growth (21M to 34M users), 80%+ organic acquisition
  • Tesla FSD Data Collection — 2M+ vehicles capture edge case clips automatically; earns $7K per vehicle as data collector vs Waymo's $150K per vehicle
  • Cursor Growth Trajectory — Grew from $1M revenue in 2023 to $100M in 2024; Tab completion model achieved 28% higher accept rates with 21% fewer suggestions
  • Stitch Fix Recovery — Hybrid AI-human model resulted in 40% increase in average order value, 40% more repeat purchases, 30% fewer returns

Research & Data

  • AWS CDO Survey — 93% of CDOs say data strategy matters for GenAI, yet 57% haven't made necessary changes
  • Appen State of AI Report 2024 — Data accuracy in the U.S. declined from 63.5% in 2021 to 26.6% in 2024; 48% identify data management as most significant obstacle
  • MIT NANDA Study — 95% of enterprise AI pilots fail to reach production with measurable value
  • Nature: Model Collapse Research — AI trained on AI-generated content degrades over time; models exhibit "narrower range of output over time" when trained recursively
  • DLA Piper GDPR Survey — GDPR fines reached EUR 1.2 billion in 2024, up 38% from the previous year
  • CNIL Guidance 2025 — French regulator clarified that AI training on personal data from public sources can use legitimate interest but requires documentation before training begins
  • EU AI Act — High-risk system compliance deadline: August 2, 2026; fines up to EUR 35 million or 7% of global revenue
  • AT&T/NVIDIA Case Study — Fine-tuned models achieved 94% accuracy vs 78% for generic GPT-4 by activating existing customer service data
  • Perplexity Growth Data — 312M queries (May 2024) to 1.4B (June 2025); DAU/MAU ratio of 53% far exceeds benchmarks
  • AI Startup Failure Analysis — 92% failure rate within 18 months; 43% built products nobody wanted; 60-70% of AI wrappers generate zero revenue

Community & Learning

  • GDPR Compliance Resources — Key requirements for AI systems include legitimate processing basis, data protection impact assessments, transparency, and deletion rights
  • CCPA/CPRA (California) — Expands "sharing" to include behavioral advertising; penalties reach $7,988 per intentional violation
  • Eight New U.S. State Privacy Laws (2025) — Delaware, Iowa, Nebraska, New Hampshire, New Jersey, Tennessee, Minnesota, and Maryland
  • Hugging Face Model Cards — Standardized documentation covering training datasets, performance metrics, known biases, and intended uses; serve as "boundary objects" accessible across disciplines

Companies Referenced in This Chapter

Harvey, IBM (Watson Health), Spotify, Duolingo, Klarna, Tesla, Cursor, AT&T, Perplexity, Notion, Discord, OpenAI, Glean, fileAI, Stitch Fix, Mistral AI, Hugging Face, Apple, NVIDIA, Duality Technologies, DuckDuckGo, Yirifi, Waymo, Ghost Autonomy