EdgeParse is a high-performance PDF-to-structured-data extraction engine written in Rust. It converts complex PDFs into clean, structured JSON, Markdown, or HTML in milliseconds without ML dependencies.

How fast is EdgeParse compared to other PDF parsers?

EdgeParse processes 40+ pages per second — 10 to 100× faster than Python-based alternatives like Docling or Marker. It achieves 0.026s average processing time per document.

What programming languages does EdgeParse support?

EdgeParse provides native bindings for Python (via PyO3), Node.js (via NAPI-RS), a standalone CLI binary, and can be used directly as a Rust library crate.

Does EdgeParse require GPU or ML models?

No. EdgeParse is a rule-based extraction engine with zero ML dependencies. No GPU, no Java, no Poppler, no Tesseract required. Just pip install edgeparse and go.

#1 Non-ML PDF Parser Leads the current benchmark · 83× faster than Docling · Zero dependencies

The PDF Engine for RAG Pipelines

Best published benchmark score without ML. 83× faster than Docling and 2× faster than OpenDataLoader. Zero GPU, zero OCR, zero JVM — just a 15 MB Rust binary with the best reported scores across reading order, tables, headings, paragraphs, text quality, and speed.

Get Started Star on GitHub

pip install edgeparse

0+ docs/sec

0% accuracy

0 ML dependencies

0 SDK languages

Works with

Python Node.js Rust CLI WebAssembly

One Command. Instant PDF Intelligence.

pip install edgeparse — and your AI stack can read any PDF in milliseconds. No GPU warmup, no model downloads, no infrastructure.

# Step 1: Register the EdgeParse agent skill
npx skills add raphaelmansuy/edgeparse --skill edgeparse

# This adds to skills-lock.json:
# {
#   "version": 1,
#   "skills": {
#     "edgeparse": {
#       "source": "raphaelmansuy/edgeparse",
#       "sourceType": "github"
#     }
#   }
# }

# Step 2: Install the Python runtime
pip install edgeparse

# macOS / Linux — one-time setup
brew tap raphaelmansuy/tap
brew install edgeparse

# Verify installation
edgeparse --version

# Parse a PDF to Markdown
edgeparse report.pdf --format markdown

# Parse to JSON with bounding boxes
edgeparse invoice.pdf --format json

# Batch convert a directory
edgeparse docs/*.pdf --format markdown --output-dir results/

import edgeparse, json

# Convert PDF to Markdown
md = edgeparse.convert("report.pdf", format="markdown")
print(md[:500])

# Parse structured JSON with bounding boxes
doc = json.loads(edgeparse.convert("report.pdf", format="json"))
for el in doc["kids"][:3]:
  print(el["type"], el.get("content", "")[:60])

# Save to output file
path = edgeparse.convert_file("report.pdf", output_dir="out/", format="markdown")

# Extract specific pages with table clustering
md = edgeparse.convert("report.pdf", pages="1-5", table_method="cluster")

import { convert, convertFile } from "edgeparse";

// Convert PDF to Markdown
const md = convert("report.pdf", { format: "markdown" });
console.log(md.slice(0, 500));

// Parse structured JSON output
const doc = JSON.parse(convert("invoice.pdf", { format: "json" }));
doc.kids.slice(0, 3).forEach(el => console.log(el.type, el.content ?? ''));

// Extract specific pages
const pages = convert("report.pdf", { format: "markdown", pages: "1-5" });

// Save to output directory
const path = convertFile("report.pdf", { outputDir: "out/", format: "markdown" });

# Install via Homebrew (macOS / Linux)
brew tap raphaelmansuy/tap && brew install edgeparse

# Or via pip
pip install edgeparse

# Extract PDF to Markdown
edgeparse report.pdf --format markdown

# Extract to JSON with bounding boxes
edgeparse invoice.pdf --format json

# Batch convert entire directory
edgeparse docs/*.pdf --format markdown --output-dir results/

import init, { convert_to_string } from 'edgeparse-wasm';

// Load WASM binary (once)
await init();

// Read PDF from user upload
const bytes = new Uint8Array(await file.arrayBuffer());

// Extract Markdown — runs entirely in the browser
const markdown = convert_to_string(bytes, 'markdown');

// Extract structured JSON
const json = convert_to_string(bytes, 'json');

// Extract HTML
const html = convert_to_string(bytes, 'html');

// Try it live: edgeparse.com/demo/

Features

Everything Your AI Stack Needs From a PDF

EdgeParse is the only PDF parser with ML-level accuracy that runs without ML — in Python, Node.js, the browser, and Rust.

83× Faster Than Docling

0.007 s/doc on Apple M4 Max. 49× faster than PyMuPDF4LLM and 2× faster than OpenDataLoader. Parallel per-page processing via Rayon — CPU only.

Best-in-Class Table Extraction

TEDS score of 0.559 — best in the current published comparison and 73% better than OpenDataLoader heuristic mode (0.323). Ruling-line + borderless cluster detection with merged cell support.

Multi-Column Reading Order

XY-Cut++ reads multi-column layouts, sidebars, and mixed content in the correct logical order. NID score of 0.885 — highest in the current benchmark snapshot.

Full Document Hierarchy

Headings, paragraphs, lists, figures — all classified with nesting. MHS score of 0.554, best among the compared engines in the current release snapshot.

WebAssembly: Runs in the Browser

The only PDF parser with a WebAssembly build. Full Rust engine in the browser — PDF data never leaves the device. No server, no uploads, offline-capable.

AI Safety Built-In

Filters hidden text, off-page content, tiny-text, and invisible layers — blocks prompt injection payloads embedded in PDFs before they reach your LLM.

Zero Dependencies

No GPU, no JVM, no OCR models, no Python runtime for the CLI. A single 15 MB binary. Deploy everywhere: Lambda, containers, edge functions, browsers.

5 SDK Languages

Native packages for Python (PyO3), Node.js (NAPI-RS), Rust, CLI binary via Homebrew/Cargo, and WebAssembly. Pre-built wheels and addons — no compilation needed.

Bounding Boxes for Citations

Every element — paragraph, heading, table, image — includes [left, bottom, right, top] coordinates in PDF points. Cite exact sources in your RAG answers.

#1 Non-ML PDF Parser in Independent Benchmarks

Tested on 200 real-world PDFs — academic papers, financial reports, multi-column layouts, and complex tables. Running on Apple M4 Max.

EdgeParse
 78.1% 
0.007 s/doc

Docling (IBM)

74.5%

0.584 s/doc

OpenDataLoader

72.3%

0.014 s/doc

PyMuPDF4LLM

71.0%

0.327 s/doc

LiteParse

56.4%

0.160 s/doc

MarkItDown

56.4%

0.123 s/doc

Tool	NID	TEDS	MHS	Overall	Speed
EdgeParse	0.885	0.559	0.554	0.781	0.007 s/doc
Docling (IBM)	0.867	0.540	0.438	0.745	0.584 s/doc
OpenDataLoader	0.861	0.323	0.436	0.723	0.014 s/doc
PyMuPDF4LLM	0.852	0.323	0.407	0.710	0.327 s/doc
LiteParse	0.815	0.000	0.001	0.564	0.160 s/doc
MarkItDown	0.807	0.193	0.001	0.564	0.123 s/doc

Benchmark snapshot updated 2026-03-28. EdgeParse leads the current benchmark on every reported quality metric while remaining CPU-only: no OCR models, no GPU, no JVM.

Head-to-Head Comparison

Why Engineers Choose EdgeParse

EdgeParse is the only PDF engine that delivers near-ML accuracy without any ML dependencies — no OCR models, no GPU, no JVM. Just a 15 MB Rust binary.

EdgeParse

0.781

Overall benchmark score

0.007 s/doc · CPU only

No GPU No OCR No JVM WebAssembly 5 SDKs

OpenDataLoader

0.723

Fast heuristic pipeline

0.014 s/doc · 1.5× slower

Python only No WASM

IBM Docling

0.745

Requires OCR / ML stack

0.584 s/doc · 12× slower

Needs OCR Heavy setup

Feature	EdgeParse This project	OpenDataLoader Heuristic	Docling IBM	PyMuPDF4LLM PyMuPDF
Overall accuracy	0.781 ✅	0.723	0.745	0.710
Speed (s/doc)	0.007 ✅	0.014	0.584	0.327
Table extraction (TEDS)	0.559 ✅	0.323	0.540	0.323
Reading order (NID)	0.885 ✅	0.861	0.867	0.852
Heading detection (MHS)	0.554 ✅	0.436	0.438	0.407
Dependencies
GPU required	❌ None	❌ None	⚠️ Optional	❌ None
OCR models required	❌ None	⚠️ Optional	✅ Required	❌ None
Binary size	15 MB ✅	~100 MB+	~500 MB+	~20 MB
SDK / Deployment
Python SDK	✅	✅	✅	✅
Node.js / JavaScript SDK	✅	❌	❌	❌
WebAssembly (browser)	✅	❌	❌	❌
Rust native library	✅	❌	❌	❌
CLI binary	✅	❌	❌	❌
Safety & Privacy
Prompt injection protection	✅	✅	❌	❌
In-browser (data never uploaded)	✅ WASM	❌	❌	❌
Deterministic output	✅	✅	❌	✅
Bounding boxes (JSON)	✅	✅	✅	❌

Benchmark: 200 real-world PDFs (academic papers, financial reports, multi-column layouts) on Apple M4 Max. Scores: NID = reading order, TEDS = table structure, MHS = heading hierarchy. Snapshot updated 2026-03-28. EdgeParse leads every reported quality metric in the current published snapshot. Full methodology →

AI Integration

One Engine, Every AI Workflow

EdgeParse sits at the foundation of your AI stack — turning messy PDFs into clean, structured data that LLMs, agents, and RAG pipelines actually understand.

Any PDF

EdgeParse

Structured Data

RAG Pipelines

Feed your vector database clean, hierarchically-chunked data with bounding boxes for source citation. No more garbled embeddings from raw PDF text.

# Chunk-ready output for your RAG pipeline
 chunks = edgeparse.convert("report.pdf", format="json")
 embeddings = embed(chunks) # Clean structured data

Embeddings Citations LangChain LlamaIndex

AI Agents

Give your AI agents the ability to read, understand, and reason over any PDF document. Structured extraction means reliable tool use — no hallucinations.

# Agent tool: extract PDF intelligence

@tool("read_pdf")
 def read_pdf(path: str) -> dict:

  return edgeparse.convert(path, format="json")

Tool Use CrewAI AutoGen OpenAI

Copilot Skills

Build custom Copilot Skills and MCP servers that give AI assistants deep PDF understanding. Extract tables, headings, and metadata on demand.

# MCP server tool definition

@server.tool("extract_pdf")
 async def extract(uri: str) -> str:

  return edgeparse.convert(uri, format="md")

MCP Copilot ChatGPT Claude

Use Cases

Built for Real Production Workloads

Teams building RAG pipelines, legal tech, financial analysis, and browser apps choose EdgeParse for its speed, accuracy, and zero-dependency deployment.

RAG & Vector Search

Feed your vector database perfectly structured, hierarchical chunks with bounding boxes for source citation. Higher retrieval quality, better LLM answers.

LangChainLlamaIndexEmbeddingsCitations

Learn more

Legal & Compliance

Extract clauses, tables, and signature blocks from contracts and regulatory filings. Deterministic output means no surprises in production.

ContractsComplianceAudit Trail

Financial Reports

Parse earnings reports, balance sheets, and SEC filings with accurate table extraction (TEDS 0.559) — columns, merged cells, and nested headers intact.

SEC FilingsEarningsTablesJSON

Research & Academic

Extract papers with correct multi-column reading order (NID 0.885) — figures, citations, and section hierarchy preserved for downstream analysis.

arXivMulti-columnCitations

In-Browser Apps (WASM)

The only PDF parser with WebAssembly support. Full extraction in the browser — no server, no uploads, privacy by design. Works offline after first load.

WebAssemblyPrivacyOfflineReact/Vue

Healthcare & Life Sciences

Process clinical notes, drug labels, and research protocols with AI safety filters that block prompt injection attacks embedded in uploaded PDFs.

HIPAASafetyStructured Data

Start Parsing PDFs in 30 Seconds

No API key. No cloud account. No GPU. Just install and parse.

pip install edgeparse

Get Started Try Live Demo Star on GitHub

Need enterprise deployment? Visit the Enterprise page or contact us for architecture reviews and production rollouts.