A simple local tool to convert PDF files into Obsidian-friendly Markdown files using EasyOCR or Docling engines with intelligent document date extraction.
- Inbox processing:
inbox/-> conversion ->processed/writes.mdnext to.pdf; on failure moves PDF tofailed/. - Automatic language detection between English and Hungarian via a 1-page probe, enhanced with accent- and stopword-based scoring.
- Image-only PDF detection with configurable thresholds and default skip behavior (no Markdown created, file moved and logged).
- Two engines: EasyOCR + pdf2image (Poppler) at 300 DPI, or Docling for advanced PDF understanding.
- Plain-text logging in
logs/run.logand resumable processing via--resumeandlogs/state.json.
- macOS
- Homebrew Poppler (for
pdf2image, only needed with the EasyOCR engine):
brew install poppler- Python 3.10+ recommended
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtIf you need a specific torch build (GPU/CUDA), refer to PyTorch install instructions. On Apple Silicon, EasyOCR runs on CPU; CUDA is not available.
inbox/— place your input PDFs hereprocessed/— processed originals and generated Markdown files (paired by basename)failed/— PDFs that could not be processed (e.g., image-only per thresholds or runtime error)logs/— run logs andstate.json
These will be created on first run if missing.
Best working command on Apple Silicon (no images, accurate tables, automatic language detection, Obsidian-friendly Markdown written to processed/):
python pdf_to_markdown.py \
--engine docling \
--device mps \
--image-export-mode placeholder \
--table-extraction-mode accurate \
--fix-markdown-lint \
--inbox inbox \
--processed processed \
--failed failed \
--logs logs- engine: switches to Docling for robust PDF→Markdown.
- device mps: Apple Silicon GPU acceleration path (Docling handles this internally).
- image-export-mode placeholder: no image files exported (smallest Markdown output). Use
referencedto keep images next to the Markdown. - table-extraction-mode accurate: higher quality table extraction.
- fix-markdown-lint: applies post-processing fixes (removes trailing punctuation from headings, adds spaces after #, ensures proper list spacing, single trailing newline).
- YAML frontmatter is prepended to match Obsidian usage (
title,source,detected_language,page_count,processed_at).
Docling does not require Poppler. Poppler is only needed for the EasyOCR path.
python pdf_to_markdown.py \
--engine easyocr \
--fix-markdown-lint \
--inbox inbox \
--processed processed \
--failed failed \
--dpi 300 \
--workers 4 \
--gpu-preference auto \
--resume \
--log logs/run.log \
--language-detection-pages 1 \
--image-only-action skip \
--min-text-chars 50 \
--min-boxes 3 \
--min-conf 0.35--fix-markdown-lint: applies post-processing fixes to improve Markdown quality.--resume: append missing pages if an output.mdalready exists inprocessed/(EasyOCR only).--gpu-preference auto: uses CUDA if available, otherwise CPU. Apple Silicon MPS is not used by EasyOCR; it will fall back to CPU.--image-only-action {skip,note}: on low-text PDFs, defaultskipmoves the PDF tofailed/and does not create Markdown.- Thresholds used during the probe to decide image-only:
--min-text-chars,--min-boxes,--min-conf.
Each output file in processed/ has the same basename as the PDF, with YAML frontmatter and per-page sections, for example. The .md sits next to the .pdf, so they pair by name and sort together:
---
title: SampleDoc
source: [SampleDoc](SampleDoc.pdf)
detected_language: en
page_count: 3
document_date: 2024-03-15
---
## Page 1
...text...
## Page 2
...text...The source field links to the sibling PDF (same directory, relative filename only).
- A 1-page probe runs OCR in both
enandhu. - Scoring combines: average OCR confidence, accented-letter ratio (weight 0.2), and stopword hit rate (weight 0.3).
- The higher score determines
detected_languagefor the full run. - Observed accuracy: correct for our sample set except one mixed-language document (started with some English but mostly Hungarian). Mixed pages can bias the probe; adjust manually by rerunning segregated PDFs if needed.
- Considered image-only if, during the probe, either:
- max(chars) <
--min-text-charsAND max(boxes) <--min-boxes, or - max(avg_conf) <
--min-confAND max(stopword_rate) < 0.005.
- max(chars) <
- Default action:
--image-only-action skip.- Logs an
IMAGE_ONLY ...line and moves the PDF tofailed/. - No Markdown is produced.
- Logs an
- If Poppler is not installed or not in PATH, you will see an error about
pdftoppmmissing. - Logs are appended to
logs/run.log. A per-file state is kept inlogs/state.jsonfor resume support.
Place a few sample PDFs in inbox/ and run the command above. Confirm:
- Language detection chooses
envshuappropriately. - Markdown created in
processed/(paired.mdnext to.pdf) with correct frontmatter and content sections. - Original PDFs moved to
processed/after success. - Re-running with
--resumeappends missing pages when applicable (EasyOCR only). - Image-only PDFs are skipped (no
.md), moved tofailed/, andIMAGE_ONLYentries appear inlogs/run.logalongsidePROBE,PAGE, andFINISHlines.