PDF Generation¶
Context: The book needs two PDF outputs: an internal draft for author review and a clean reader version for publication.
generate_pdf.pyhandles both, with Mermaid diagram rendering and persistent caching.
Generating a PDF from an Obsidian vault sounds simple until you actually try it. Wiki-links need resolving. YAML frontmatter needs stripping. Mermaid diagrams need rendering to SVG. Internal comment blocks need hiding in reader mode but showing in author mode. And the whole thing needs to look like a book, not a web page.
Two Modes, Two Audiences¶
| Mode | Flag | Who It's For | What It Shows |
|---|---|---|---|
| Internal | --mode internal (default) |
Author, reviewers | Everything: research source blocks, internal notes, draft metadata |
| Reader | --mode reader |
Publication, distribution | Clean text only: no <!-- INTERNAL: ... --> blocks, formatted citations, professional layout |
The internal mode is your working copy. It includes the <!-- INTERNAL: Research Sources --> blocks that track which research files informed each section. When you're reviewing chapter 8 and wondering "did I use the McKinsey study here?", the internal PDF answers that without switching to Obsidian.
The reader mode strips all of that. What remains is the book as your audience will see it.
How the Pipeline Works¶
The PDF generator is a multi-module package (pdf_generator/) with 11 files handling discovery, parsing, rendering, Mermaid caching, table of contents, and styling. The pipeline:
-
File discovery. Reads all markdown files from the draft folder in order: part intros, then chapter folders, then sections within each chapter. The ordering is deterministic -- files sort by their numeric prefixes (
01-,02-, etc.). -
Frontmatter stripping. Every file starts with YAML frontmatter (
---delimited). The parser removes it. In reader mode, it also removes Obsidian comments (%%...%%) and HTML comment blocks (<!-- INTERNAL: ... -->). -
Link resolution. Wiki-links like
[[concepts/Data Flywheel|Data Flywheel]]become plain text in the PDF. The display text is preserved; the link syntax is stripped. -
Mermaid rendering. Diagrams in
```mermaidcode blocks are rendered to SVG, then embedded inline. This usesmmdc(Mermaid CLI) locally when available, with an API fallback. Rendering happens in parallel -- 4 workers by default. -
Markdown to HTML. The Python
markdownlibrary converts the processed markdown to HTML. -
HTML to PDF. WeasyPrint renders HTML to PDF with CSS-based styling. This gives full control over typography, margins, page breaks, and layout -- more flexible than Pandoc for book-specific formatting.
Mermaid Caching¶
This is the performance-critical piece. The book has dozens of Mermaid diagrams -- architecture flows, decision trees, sequence diagrams. Rendering each one through mmdc takes 2-5 seconds. Without caching, a full book PDF takes 10+ minutes.
The caching system (pdf_generator/cache.py) hashes the diagram source text and stores the rendered SVG in output/cache/. On subsequent runs, only modified diagrams re-render. A full book PDF with warm cache takes under 60 seconds.
# Show what's cached
python scripts/generate_pdf.py --cache-stats
# Force fresh rendering
python scripts/generate_pdf.py --clear-cache
# Disable caching entirely (slow, useful for debugging)
python scripts/generate_pdf.py --no-cache
# More parallel workers for faster rendering
python scripts/generate_pdf.py --parallel 8
Commands¶
# Default: internal mode, Draft 3
python scripts/generate_pdf.py
# Reader-ready PDF from specific draft
python scripts/generate_pdf.py --draft "Draft 3" --mode reader
# Custom output filename
python scripts/generate_pdf.py --output my_book.pdf
# Include removed/archived content (excluded by default)
python scripts/generate_pdf.py --include-removed
The output lands in output/ with a timestamped filename: Building_AI_First_Companies_Draft_3_reader_20250205_1430.pdf. Timestamps prevent overwriting previous versions -- useful when comparing drafts.
Why WeasyPrint¶
We tried Pandoc first. It works for straightforward documents, but book layout demands CSS-level control. WeasyPrint renders HTML with CSS, which means:
- Page breaks. CSS
break-before: pageon chapter headers. No manual page break markers in the markdown. - Typography. Font families, sizes, line heights, and margins controlled in one stylesheet. Change the body font across the entire book in one line.
- Headers/footers. Running headers with chapter titles, page numbers, consistent formatting.
- Print optimization. Widows, orphans, column balancing -- CSS properties that Pandoc's intermediate LaTeX can handle but with more friction.
The trade-off: WeasyPrint requires pango on macOS (brew install pango) and the weasyprint Python package. The dependency is heavier than Pandoc. Worth it for the layout control.
Dependencies¶
pip install markdown weasyprint pyyaml
brew install pango # macOS only
npm install -g @mermaid-js/mermaid-cli # Optional, for local Mermaid rendering
Without mermaid-cli installed, the generator falls back to an API-based renderer. Slower, but works without Node.js.
Related: Script Ecosystem