Citation Auditor Template¶
Pseudocode and logic for building a citation audit script that checks density, finds uncited claims, and detects common footnote problems.
Purpose¶
Audit citation density across your manuscript and flag statistical claims that lack footnotes.
What to Detect¶
Citation References¶
Two patterns to match:
| Pattern | Location | Example |
|---|---|---|
[^tag-name] |
Body text (inline reference) | reached $100M ARR[^harvey-arr] |
[^tag-name]: ... |
References section (definition) | [^harvey-arr]: Harvey 2024... |
Uncited Statistical Claims¶
Regex patterns that should typically have a citation nearby:
Percentages: \d+(\.\d+)?%
Dollar amounts: \$[\d.,]+\s*(million|billion|M|B|K)?
Large numbers: \d{1,3}(,\d{3})+
Multipliers: \d+x\s+(faster|slower|more|better|cheaper)
Ratios: \d+\s+out of\s+\d+
Growth phrases: doubled|tripled|grew by
Time specifics: in\s+\d{4}|since\s+\d{4}
Core Logic (Pseudocode)¶
def audit_citations(section_path):
content = read_file(section_path)
body_text = extract_body(content) # exclude frontmatter, code blocks
# Count citation references in body
inline_refs = find_all(r'\[\^[\w-]+\]', body_text)
# Count citation definitions in references
definitions = find_all(r'^\[\^[\w-]+\]:', content)
# Count words (excluding metadata)
word_count = count_words(section_path)
# Calculate density
density = len(inline_refs) / (word_count / 1000)
# Find uncited stats
stats = find_all(STAT_PATTERNS, body_text)
uncited = [s for s in stats if no_citation_within(s, radius=50_chars)]
# Find orphaned footnotes
orphaned_refs = [r for r in inline_refs if r not in definitions]
orphaned_defs = [d for d in definitions if d not in inline_refs]
# Find duplicate URLs
urls = extract_urls_from_definitions(content)
duplicates = find_duplicate_urls(urls)
return {
'citations': len(inline_refs),
'words': word_count,
'density_per_1k': density,
'benchmark': word_count / [YOUR WORDS PER CITATION],
'uncited_claims': uncited,
'orphaned_refs': orphaned_refs,
'orphaned_defs': orphaned_defs,
'duplicate_urls': duplicates
}
Output Format¶
Section 6.1: [Section Title]
Citations: 8
Words: 1,187
Density: 6.7 per 1,000 words
Benchmark: 7.9 (1 per 150 words)
Status: BELOW BENCHMARK
Uncited claims:
Line 45: "85% of enterprises..." -- needs citation
Line 78: "$4.2 billion market..." -- needs citation
Orphaned references:
[^missing-def] -- referenced but never defined
Duplicate URLs:
https://example.com/report -- used by [^tag-a] and [^tag-b]
Benchmark Targets¶
Fill in based on your citation density goals:
| Section Length | Target Citations | Density |
|---|---|---|
| 800 words | [N] | [N] per 1K |
| 1,200 words | [N] | [N] per 1K |
| 1,800 words | [N] | [N] per 1K |
Key Features to Build¶
- Citation counting per section
- Density calculation against your benchmark
- Uncited statistical claim detection with line numbers
- Duplicate URL detection (same URL, different footnote tags)
- Orphaned footnote detection (both directions)
- Per-chapter summary view
-
--fixmode for auto-standardization of duplicates -
--dry-runfor previewing fixes before applying
Dependencies¶
- Python 3.8+
re(standard library) -- regex matchingpyyaml-- frontmatter parsingrich(optional) -- colored terminal output