Automatically generate structured markdown summaries of academic PDFs for use as context in engineering codebases.
- Sophisticated title extraction: PDF metadata (/Title, XMP) → first-page text heuristics → filename fallback
- LLM-based summarization: OpenAI-compatible API with map-reduce strategy for high-quality summaries
- Multiple LLM providers: Supports OpenAI, OpenRouter, Gemini, or any OpenAI-compatible endpoint
- Structured output: TL;DR, Problem, Approach, Results, Practical Takeaways, Limitations
- Incremental processing: Caches extracted text, skips re-extraction for unchanged PDFs
- Customizable prompts: Configure via
prompts.json
# Create & activate a virtual environment (recommended)
uv venv
# Windows (PowerShell)
.\.venv\Scripts\Activate.ps1
# macOS/Linux
source .venv/bin/activate
# Install dependencies
uv pip install -r requirements.txt
# Required: Set up environment variables
cp .env.local .env
# Edit .env and add your OPENAI_API_KEY (required)OPENAI_API_KEY- Required for LLM summarizationOPENAI_MODEL- Model to use (default:gpt-5-mini-2025-08-07)OPENAI_BASE_URL- API base URL (optional, for OpenRouter, Gemini, etc.)
Customize summarization prompts and chunking via prompts.json:
{
"chunk_prompt": "...",
"reduce_prompt": "...",
"chunk_max_chars": 12000,
"max_chunks": 8
}Available Keys:
chunk_prompt: Template for summarizing individual chunks. Placeholders:{title},{idx},{total},{chunk}.reduce_prompt: Template for the final combination step. Placeholders:{title},{summaries}.chunk_max_chars: Maximum characters per text chunk (default:12000).max_chunks: Maximum number of chunks to process per paper (default:8).
# Basic usage (requires OPENAI_API_KEY in .env or environment)
python summarize_papers.py
# Or set API key inline
OPENAI_API_KEY=sk-... python summarize_papers.py
# Custom options
python summarize_papers.py --papers-dir papers --out output/PAPERS_SUMMARY.md --max-pages 10
# Force re-summarize all papers (ignore cache)
python summarize_papers.py --no-cache
# Clear cache and re-run
python summarize_papers.py --clear-cache--papers-dir DIR- Directory containing PDFs (default:papers)--out FILE- Output markdown path (default:output/PAPERS_SUMMARY.md)--max-pages N- Limit pages per PDF, 0 = all pages (default: 0)--no-cache- Disable caching, re-extract text from all PDFs--clear-cache- Clear cache before running
graph TD
A[📁 PDF Directory] --> B[Extract PDFs]
B --> C[For each PDF]
subgraph "1. Extraction"
C --> D{Title Strategy}
D -->|1. Override| E[Manual Dict]
D -->|2. Metadata| F[PDF /Title or XMP]
D -->|3. Heuristic| G[First Page Text]
D -->|4. Fallback| H[Filename]
E --> I[Extract Full Text<br/>pdfminer.six]
F --> I
G --> I
H --> I
I --> J[Clean Text]
end
J --> K[Paper Object]
subgraph "2. Summarization"
K --> L[Load Config<br/>prompts.json]
L --> M[Chunk Text]
M --> N[Map: Summarize Chunks<br/>OpenAI API]
N --> O[Reduce: Combine<br/>OpenAI API]
end
O --> P[Final Summary]
P --> Q{More PDFs?}
Q -->|Yes| C
Q -->|No| R[Build Markdown]
R --> S[Write Output]
Key stages:
- PDF → Paper - Title extraction cascade + text extraction with fallback strategies
- Paper → Summarized Paper - Map-reduce LLM summarization (chunk → summarize → combine)
- Papers → Markdown - Build structured output with index and summaries
The codebase follows a deep modules design pattern with strict separation of concerns:
lib/pdf_extract.py- Deep module hiding all PDF parsing complexitylib/text_clean.py- Pure text transformation functionslib/content_analysis.py- Pure analysis functions (abstract, DOI, contributions)lib/summarization.py- LLM-based summarization (OpenAI)lib/cache.py- Caches extracted text (not summaries) for incremental processinglib/models.py- Immutable dataclasses (Paper, ExtractedContent)summarize_papers.py- Thin orchestration layer
If a PDF's title metadata is corrupt or missing, add an override to lib/pdf_extract.py:
TITLE_OVERRIDES: dict[str, str] = {
"filename.pdf": "Actual Paper Title",
}Generated markdown includes:
- Index of all papers with anchor links
- Per-paper summaries with:
- TL;DR (3 bullets)
- Problem statement
- Approach/methodology
- Results with metrics
- Practical takeaways
- Limitations and open questions
- DOI link (if available)
pdfminer.six- PDF text extraction (preferred for two-column layouts)python-dotenv- Environment variable loadingtqdm- Progress barsopenai- Required for LLM-based summarization
See LICENSE file for details.