Skip to content
#1 Non-ML PDF Parser Leads the current benchmark · 83× faster than Docling · Zero dependencies

The PDF Engine for RAG Pipelines

Best published benchmark score without ML. 83× faster than Docling and 2× faster than OpenDataLoader. Zero GPU, zero OCR, zero JVM — just a 15 MB Rust binary with the best reported scores across reading order, tables, headings, paragraphs, text quality, and speed.

pip install edgeparse
0+ docs/sec
0% accuracy
0 ML dependencies
0 SDK languages
Works with
Python Node.js Rust CLI WebAssembly

One Command. Instant PDF Intelligence.

pip install edgeparse — and your AI stack can read any PDF in milliseconds. No GPU warmup, no model downloads, no infrastructure.

# Step 1: Register the EdgeParse agent skill
npx skills add raphaelmansuy/edgeparse --skill edgeparse

# This adds to skills-lock.json:
# {
#   "version": 1,
#   "skills": {
#     "edgeparse": {
#       "source": "raphaelmansuy/edgeparse",
#       "sourceType": "github"
#     }
#   }
# }

# Step 2: Install the Python runtime
pip install edgeparse
# macOS / Linux — one-time setup
brew tap raphaelmansuy/tap
brew install edgeparse

# Verify installation
edgeparse --version

# Parse a PDF to Markdown
edgeparse report.pdf --format markdown

# Parse to JSON with bounding boxes
edgeparse invoice.pdf --format json

# Batch convert a directory
edgeparse docs/*.pdf --format markdown --output-dir results/
import edgeparse, json

# Convert PDF to Markdown
md = edgeparse.convert("report.pdf", format="markdown")
print(md[:500])

# Parse structured JSON with bounding boxes
doc = json.loads(edgeparse.convert("report.pdf", format="json"))
for el in doc["kids"][:3]:
  print(el["type"], el.get("content", "")[:60])

# Save to output file
path = edgeparse.convert_file("report.pdf", output_dir="out/", format="markdown")

# Extract specific pages with table clustering
md = edgeparse.convert("report.pdf", pages="1-5", table_method="cluster")
import { convert, convertFile } from "edgeparse";

// Convert PDF to Markdown
const md = convert("report.pdf", { format: "markdown" });
console.log(md.slice(0, 500));

// Parse structured JSON output
const doc = JSON.parse(convert("invoice.pdf", { format: "json" }));
doc.kids.slice(0, 3).forEach(el => console.log(el.type, el.content ?? ''));

// Extract specific pages
const pages = convert("report.pdf", { format: "markdown", pages: "1-5" });

// Save to output directory
const path = convertFile("report.pdf", { outputDir: "out/", format: "markdown" });
# Install via Homebrew (macOS / Linux)
brew tap raphaelmansuy/tap && brew install edgeparse

# Or via pip
pip install edgeparse

# Extract PDF to Markdown
edgeparse report.pdf --format markdown

# Extract to JSON with bounding boxes
edgeparse invoice.pdf --format json

# Batch convert entire directory
edgeparse docs/*.pdf --format markdown --output-dir results/
import init, { convert_to_string } from 'edgeparse-wasm';

// Load WASM binary (once)
await init();

// Read PDF from user upload
const bytes = new Uint8Array(await file.arrayBuffer());

// Extract Markdown — runs entirely in the browser
const markdown = convert_to_string(bytes, 'markdown');

// Extract structured JSON
const json = convert_to_string(bytes, 'json');

// Extract HTML
const html = convert_to_string(bytes, 'html');

// Try it live: edgeparse.com/demo/
Features

Everything Your AI Stack Needs From a PDF

EdgeParse is the only PDF parser with ML-level accuracy that runs without ML — in Python, Node.js, the browser, and Rust.

83× Faster Than Docling

0.007 s/doc on Apple M4 Max. 49× faster than PyMuPDF4LLM and 2× faster than OpenDataLoader. Parallel per-page processing via Rayon — CPU only.

Best-in-Class Table Extraction

TEDS score of 0.559 — best in the current published comparison and 73% better than OpenDataLoader heuristic mode (0.323). Ruling-line + borderless cluster detection with merged cell support.

Multi-Column Reading Order

XY-Cut++ reads multi-column layouts, sidebars, and mixed content in the correct logical order. NID score of 0.885 — highest in the current benchmark snapshot.

Full Document Hierarchy

Headings, paragraphs, lists, figures — all classified with nesting. MHS score of 0.554, best among the compared engines in the current release snapshot.

WebAssembly: Runs in the Browser

The only PDF parser with a WebAssembly build. Full Rust engine in the browser — PDF data never leaves the device. No server, no uploads, offline-capable.

AI Safety Built-In

Filters hidden text, off-page content, tiny-text, and invisible layers — blocks prompt injection payloads embedded in PDFs before they reach your LLM.

Zero Dependencies

No GPU, no JVM, no OCR models, no Python runtime for the CLI. A single 15 MB binary. Deploy everywhere: Lambda, containers, edge functions, browsers.

5 SDK Languages

Native packages for Python (PyO3), Node.js (NAPI-RS), Rust, CLI binary via Homebrew/Cargo, and WebAssembly. Pre-built wheels and addons — no compilation needed.

Bounding Boxes for Citations

Every element — paragraph, heading, table, image — includes [left, bottom, right, top] coordinates in PDF points. Cite exact sources in your RAG answers.

#1 Non-ML PDF Parser in Independent Benchmarks

Tested on 200 real-world PDFs — academic papers, financial reports, multi-column layouts, and complex tables. Running on Apple M4 Max.

EdgeParse
78.1%
0.007 s/doc
Docling (IBM)
74.5%
0.584 s/doc
OpenDataLoader
72.3%
0.014 s/doc
PyMuPDF4LLM
71.0%
0.327 s/doc
LiteParse
56.4%
0.160 s/doc
MarkItDown
56.4%
0.123 s/doc
Tool NID TEDS MHS Overall Speed
EdgeParse 0.885 0.559 0.554 0.781 0.007 s/doc
Docling (IBM) 0.867 0.540 0.438 0.745 0.584 s/doc
OpenDataLoader 0.861 0.323 0.436 0.723 0.014 s/doc
PyMuPDF4LLM 0.852 0.323 0.407 0.710 0.327 s/doc
LiteParse 0.815 0.000 0.001 0.564 0.160 s/doc
MarkItDown 0.807 0.193 0.001 0.564 0.123 s/doc

Benchmark snapshot updated 2026-03-28. EdgeParse leads the current benchmark on every reported quality metric while remaining CPU-only: no OCR models, no GPU, no JVM.

Head-to-Head Comparison

Why Engineers Choose EdgeParse

EdgeParse is the only PDF engine that delivers near-ML accuracy without any ML dependencies — no OCR models, no GPU, no JVM. Just a 15 MB Rust binary.

EdgeParse
0.781
Overall benchmark score
0.007 s/doc · CPU only
No GPU No OCR No JVM WebAssembly 5 SDKs
OpenDataLoader
0.723
Fast heuristic pipeline
0.014 s/doc · 1.5× slower
Python only No WASM
IBM Docling
0.745
Requires OCR / ML stack
0.584 s/doc · 12× slower
Needs OCR Heavy setup
Feature EdgeParse This project OpenDataLoader Heuristic Docling IBM PyMuPDF4LLM PyMuPDF
Overall accuracy 0.781 0.723 0.745 0.710
Speed (s/doc) 0.007 0.014 0.584 0.327
Table extraction (TEDS) 0.559 0.323 0.540 0.323
Reading order (NID) 0.885 0.861 0.867 0.852
Heading detection (MHS) 0.554 0.436 0.438 0.407
Dependencies
GPU required ❌ None ❌ None ⚠️ Optional ❌ None
OCR models required ❌ None ⚠️ Optional ✅ Required ❌ None
Binary size 15 MB ~100 MB+ ~500 MB+ ~20 MB
SDK / Deployment
Python SDK
Node.js / JavaScript SDK
WebAssembly (browser)
Rust native library
CLI binary
Safety & Privacy
Prompt injection protection
In-browser (data never uploaded) WASM
Deterministic output
Bounding boxes (JSON)

Benchmark: 200 real-world PDFs (academic papers, financial reports, multi-column layouts) on Apple M4 Max. Scores: NID = reading order, TEDS = table structure, MHS = heading hierarchy. Snapshot updated 2026-03-28. EdgeParse leads every reported quality metric in the current published snapshot. Full methodology →

AI Integration

One Engine, Every AI Workflow

EdgeParse sits at the foundation of your AI stack — turning messy PDFs into clean, structured data that LLMs, agents, and RAG pipelines actually understand.

Any PDF
EdgeParse
Structured Data

RAG Pipelines

Feed your vector database clean, hierarchically-chunked data with bounding boxes for source citation. No more garbled embeddings from raw PDF text.

# Chunk-ready output for your RAG pipeline
chunks = edgeparse.convert("report.pdf", format="json")
embeddings = embed(chunks) # Clean structured data
Embeddings Citations LangChain LlamaIndex

Copilot Skills

Build custom Copilot Skills and MCP servers that give AI assistants deep PDF understanding. Extract tables, headings, and metadata on demand.

# MCP server tool definition
@server.tool("extract_pdf")
async def extract(uri: str) -> str:
  return edgeparse.convert(uri, format="md")
MCP Copilot ChatGPT Claude
Use Cases

Built for Real Production Workloads

Teams building RAG pipelines, legal tech, financial analysis, and browser apps choose EdgeParse for its speed, accuracy, and zero-dependency deployment.

RAG & Vector Search

Feed your vector database perfectly structured, hierarchical chunks with bounding boxes for source citation. Higher retrieval quality, better LLM answers.

LangChainLlamaIndexEmbeddingsCitations
Learn more

Legal & Compliance

Extract clauses, tables, and signature blocks from contracts and regulatory filings. Deterministic output means no surprises in production.

ContractsComplianceAudit Trail

Financial Reports

Parse earnings reports, balance sheets, and SEC filings with accurate table extraction (TEDS 0.559) — columns, merged cells, and nested headers intact.

SEC FilingsEarningsTablesJSON

Research & Academic

Extract papers with correct multi-column reading order (NID 0.885) — figures, citations, and section hierarchy preserved for downstream analysis.

arXivMulti-columnCitations

In-Browser Apps (WASM)

The only PDF parser with WebAssembly support. Full extraction in the browser — no server, no uploads, privacy by design. Works offline after first load.

WebAssemblyPrivacyOfflineReact/Vue

Healthcare & Life Sciences

Process clinical notes, drug labels, and research protocols with AI safety filters that block prompt injection attacks embedded in uploaded PDFs.

HIPAASafetyStructured Data

Start Parsing PDFs in 30 Seconds

No API key. No cloud account. No GPU. Just install and parse.

pip install edgeparse

Need enterprise deployment? Visit the Enterprise page or contact us for architecture reviews and production rollouts.