The PDF Engine for RAG Pipelines
Best published benchmark score without ML. 83× faster than Docling and 2× faster than OpenDataLoader. Zero GPU, zero OCR, zero JVM — just a 15 MB Rust binary with the best reported scores across reading order, tables, headings, paragraphs, text quality, and speed.
One Command. Instant PDF Intelligence.
pip install edgeparse — and your AI stack can read any PDF in milliseconds. No GPU warmup, no model downloads, no infrastructure.
# Step 1: Register the EdgeParse agent skill
npx skills add raphaelmansuy/edgeparse --skill edgeparse
# This adds to skills-lock.json:
# {
# "version": 1,
# "skills": {
# "edgeparse": {
# "source": "raphaelmansuy/edgeparse",
# "sourceType": "github"
# }
# }
# }
# Step 2: Install the Python runtime
pip install edgeparse # macOS / Linux — one-time setup
brew tap raphaelmansuy/tap
brew install edgeparse
# Verify installation
edgeparse --version
# Parse a PDF to Markdown
edgeparse report.pdf --format markdown
# Parse to JSON with bounding boxes
edgeparse invoice.pdf --format json
# Batch convert a directory
edgeparse docs/*.pdf --format markdown --output-dir results/ import edgeparse, json
# Convert PDF to Markdown
md = edgeparse.convert("report.pdf", format="markdown")
print(md[:500])
# Parse structured JSON with bounding boxes
doc = json.loads(edgeparse.convert("report.pdf", format="json"))
for el in doc["kids"][:3]:
print(el["type"], el.get("content", "")[:60])
# Save to output file
path = edgeparse.convert_file("report.pdf", output_dir="out/", format="markdown")
# Extract specific pages with table clustering
md = edgeparse.convert("report.pdf", pages="1-5", table_method="cluster") import { convert, convertFile } from "edgeparse";
// Convert PDF to Markdown
const md = convert("report.pdf", { format: "markdown" });
console.log(md.slice(0, 500));
// Parse structured JSON output
const doc = JSON.parse(convert("invoice.pdf", { format: "json" }));
doc.kids.slice(0, 3).forEach(el => console.log(el.type, el.content ?? ''));
// Extract specific pages
const pages = convert("report.pdf", { format: "markdown", pages: "1-5" });
// Save to output directory
const path = convertFile("report.pdf", { outputDir: "out/", format: "markdown" }); # Install via Homebrew (macOS / Linux)
brew tap raphaelmansuy/tap && brew install edgeparse
# Or via pip
pip install edgeparse
# Extract PDF to Markdown
edgeparse report.pdf --format markdown
# Extract to JSON with bounding boxes
edgeparse invoice.pdf --format json
# Batch convert entire directory
edgeparse docs/*.pdf --format markdown --output-dir results/ import init, { convert_to_string } from 'edgeparse-wasm';
// Load WASM binary (once)
await init();
// Read PDF from user upload
const bytes = new Uint8Array(await file.arrayBuffer());
// Extract Markdown — runs entirely in the browser
const markdown = convert_to_string(bytes, 'markdown');
// Extract structured JSON
const json = convert_to_string(bytes, 'json');
// Extract HTML
const html = convert_to_string(bytes, 'html');
// Try it live: edgeparse.com/demo/ Everything Your AI Stack Needs From a PDF
EdgeParse is the only PDF parser with ML-level accuracy that runs without ML — in Python, Node.js, the browser, and Rust.
83× Faster Than Docling
0.007 s/doc on Apple M4 Max. 49× faster than PyMuPDF4LLM and 2× faster than OpenDataLoader. Parallel per-page processing via Rayon — CPU only.
Best-in-Class Table Extraction
TEDS score of 0.559 — best in the current published comparison and 73% better than OpenDataLoader heuristic mode (0.323). Ruling-line + borderless cluster detection with merged cell support.
Multi-Column Reading Order
XY-Cut++ reads multi-column layouts, sidebars, and mixed content in the correct logical order. NID score of 0.885 — highest in the current benchmark snapshot.
Full Document Hierarchy
Headings, paragraphs, lists, figures — all classified with nesting. MHS score of 0.554, best among the compared engines in the current release snapshot.
WebAssembly: Runs in the Browser
The only PDF parser with a WebAssembly build. Full Rust engine in the browser — PDF data never leaves the device. No server, no uploads, offline-capable.
AI Safety Built-In
Filters hidden text, off-page content, tiny-text, and invisible layers — blocks prompt injection payloads embedded in PDFs before they reach your LLM.
Zero Dependencies
No GPU, no JVM, no OCR models, no Python runtime for the CLI. A single 15 MB binary. Deploy everywhere: Lambda, containers, edge functions, browsers.
5 SDK Languages
Native packages for Python (PyO3), Node.js (NAPI-RS), Rust, CLI binary via Homebrew/Cargo, and WebAssembly. Pre-built wheels and addons — no compilation needed.
Bounding Boxes for Citations
Every element — paragraph, heading, table, image — includes [left, bottom, right, top] coordinates in PDF points. Cite exact sources in your RAG answers.
#1 Non-ML PDF Parser in Independent Benchmarks
Tested on 200 real-world PDFs — academic papers, financial reports, multi-column layouts, and complex tables. Running on Apple M4 Max.
| Tool | NID | TEDS | MHS | Overall | Speed |
|---|---|---|---|---|---|
| EdgeParse | 0.885 | 0.559 | 0.554 | 0.781 | 0.007 s/doc |
| Docling (IBM) | 0.867 | 0.540 | 0.438 | 0.745 | 0.584 s/doc |
| OpenDataLoader | 0.861 | 0.323 | 0.436 | 0.723 | 0.014 s/doc |
| PyMuPDF4LLM | 0.852 | 0.323 | 0.407 | 0.710 | 0.327 s/doc |
| LiteParse | 0.815 | 0.000 | 0.001 | 0.564 | 0.160 s/doc |
| MarkItDown | 0.807 | 0.193 | 0.001 | 0.564 | 0.123 s/doc |
Benchmark snapshot updated 2026-03-28. EdgeParse leads the current benchmark on every reported quality metric while remaining CPU-only: no OCR models, no GPU, no JVM.
Why Engineers Choose EdgeParse
EdgeParse is the only PDF engine that delivers near-ML accuracy without any ML dependencies — no OCR models, no GPU, no JVM. Just a 15 MB Rust binary.
| Feature | EdgeParse This project | OpenDataLoader Heuristic | Docling IBM | PyMuPDF4LLM PyMuPDF |
|---|---|---|---|---|
| Overall accuracy | 0.781 ✅ | 0.723 | 0.745 | 0.710 |
| Speed (s/doc) | 0.007 ✅ | 0.014 | 0.584 | 0.327 |
| Table extraction (TEDS) | 0.559 ✅ | 0.323 | 0.540 | 0.323 |
| Reading order (NID) | 0.885 ✅ | 0.861 | 0.867 | 0.852 |
| Heading detection (MHS) | 0.554 ✅ | 0.436 | 0.438 | 0.407 |
| Dependencies | ||||
| GPU required | ❌ None | ❌ None | ⚠️ Optional | ❌ None |
| OCR models required | ❌ None | ⚠️ Optional | ✅ Required | ❌ None |
| Binary size | 15 MB ✅ | ~100 MB+ | ~500 MB+ | ~20 MB |
| SDK / Deployment | ||||
| Python SDK | ✅ | ✅ | ✅ | ✅ |
| Node.js / JavaScript SDK | ✅ | ❌ | ❌ | ❌ |
| WebAssembly (browser) | ✅ | ❌ | ❌ | ❌ |
| Rust native library | ✅ | ❌ | ❌ | ❌ |
| CLI binary | ✅ | ❌ | ❌ | ❌ |
| Safety & Privacy | ||||
| Prompt injection protection | ✅ | ✅ | ❌ | ❌ |
| In-browser (data never uploaded) | ✅ WASM | ❌ | ❌ | ❌ |
| Deterministic output | ✅ | ✅ | ❌ | ✅ |
| Bounding boxes (JSON) | ✅ | ✅ | ✅ | ❌ |
Benchmark: 200 real-world PDFs (academic papers, financial reports, multi-column layouts) on Apple M4 Max. Scores: NID = reading order, TEDS = table structure, MHS = heading hierarchy. Snapshot updated 2026-03-28. EdgeParse leads every reported quality metric in the current published snapshot. Full methodology →
One Engine, Every AI Workflow
EdgeParse sits at the foundation of your AI stack — turning messy PDFs into clean, structured data that LLMs, agents, and RAG pipelines actually understand.
RAG Pipelines
Feed your vector database clean, hierarchically-chunked data with bounding boxes for source citation. No more garbled embeddings from raw PDF text.
# Chunk-ready output for your RAG pipeline
chunks = edgeparse.convert("report.pdf", format="json")
embeddings = embed(chunks) # Clean structured data AI Agents
Give your AI agents the ability to read, understand, and reason over any PDF document. Structured extraction means reliable tool use — no hallucinations.
# Agent tool: extract PDF intelligence
@tool("read_pdf")
def read_pdf(path: str) -> dict:
return edgeparse.convert(path, format="json") Copilot Skills
Build custom Copilot Skills and MCP servers that give AI assistants deep PDF understanding. Extract tables, headings, and metadata on demand.
# MCP server tool definition
@server.tool("extract_pdf")
async def extract(uri: str) -> str:
return edgeparse.convert(uri, format="md") Built for Real Production Workloads
Teams building RAG pipelines, legal tech, financial analysis, and browser apps choose EdgeParse for its speed, accuracy, and zero-dependency deployment.
RAG & Vector Search
Feed your vector database perfectly structured, hierarchical chunks with bounding boxes for source citation. Higher retrieval quality, better LLM answers.
Learn moreLegal & Compliance
Extract clauses, tables, and signature blocks from contracts and regulatory filings. Deterministic output means no surprises in production.
Financial Reports
Parse earnings reports, balance sheets, and SEC filings with accurate table extraction (TEDS 0.559) — columns, merged cells, and nested headers intact.
Research & Academic
Extract papers with correct multi-column reading order (NID 0.885) — figures, citations, and section hierarchy preserved for downstream analysis.
In-Browser Apps (WASM)
The only PDF parser with WebAssembly support. Full extraction in the browser — no server, no uploads, privacy by design. Works offline after first load.
Healthcare & Life Sciences
Process clinical notes, drug labels, and research protocols with AI safety filters that block prompt injection attacks embedded in uploaded PDFs.
Start Parsing PDFs in 30 Seconds
No API key. No cloud account. No GPU. Just install and parse.
pip install edgeparse Need enterprise deployment? Visit the Enterprise page or contact us for architecture reviews and production rollouts.