feat: render Formula elements as $$ blocks with optional normalization by claytonlin1110 · Pull Request #4308 · Unstructured-IO/unstructured

claytonlin1110 · 2026-03-31T17:50:36Z

Summary

Render Formula elements as Markdown math blocks ($$ ... $$) during markdown staging export.
Add optional normalize_formula flag (default True) to control conservative Unicode-math to LaTeX-like normalization.
Add tests for:
- formula block rendering
- normalization on/off behavior
- passthrough of normalize_formula via create_file_from_elements()
Update changelog and bump package version to 0.22.11.

PastelStorm

Review

Severity: `High`

Description: elements_to_md() silently breaks backward compatibility for positional callers by inserting normalize_formula before encoding.

def elements_to_md(
    elements: Iterable[Element],
    filename: Optional[str] = None,
    exclude_binary_image_data: bool = False,
    normalize_formula: bool = True,
    encoding: str = "utf-8",
) -> str:

def elements_to_md(
    elements: Iterable[Element],
    filename: Optional[str] = None,
    exclude_binary_image_data: bool = False,
    encoding: str = "utf-8",
) -> str:

Why it is risky: old valid calls like elements_to_md(elements, path, False, "latin-1") now bind "latin-1" to normalize_formula instead of encoding. The call succeeds, so the regression is silent, but writes with UTF-8 instead of the requested encoding.

Concrete failure scenario: a document-export worker writes markdown for a downstream Latin-1 consumer. After this PR, retries all succeed but produce incorrectly encoded output, causing corrupted artifacts or decode failures downstream.

Severity: `High`

Description: create_file_from_elements() has the same silent positional-argument break by inserting normalize_formula before no_group_by_page.

def create_file_from_elements(
    elements: Iterable[Element],
    output_format: str = "markdown",
    filename: Optional[str] = None,
    encoding: str = "utf-8",
    exclude_binary_image_data: bool = False,
    normalize_formula: bool = True,
    no_group_by_page: bool = True,
) -> str:

def create_file_from_elements(
    elements: Iterable[Element],
    output_format: str = "markdown",
    filename: Optional[str] = None,
    encoding: str = "utf-8",
    exclude_binary_image_data: bool = False,
    no_group_by_page: bool = True,
) -> str:

Why it is risky: older positional callers now pass their intended no_group_by_page value into normalize_formula, and no_group_by_page silently reverts to its default. That changes output semantics with no exception.

Concrete failure scenario: an HTML export path calls create_file_from_elements(elements, "html", out, "utf-8", False, False) expecting grouped-by-page behavior. After this change, the last False disables formula normalization instead, while HTML grouping silently changes, so production exports return structurally different HTML under load and retries reproduce the same bad output.

Severity: `High`

Description: your √ finding is real. Mapping √ to \sqrt{} is not conservative normalization; it corrupts the math expression unless the radicand is explicitly reparsed and wrapped.

def _normalize_formula_for_markdown(text: str) -> str:
    ...
    substitutions = {
        "−": "-",  # Unicode minus -> ASCII hyphen-minus
        "×": r"\times",
        ...
        "√": r"\sqrt{}",
    }
    normalized = text
    for source, target in substitutions.items():
        normalized = normalized.replace(source, target)
    return normalized

Why it is risky: √x becomes \sqrt{}x, which changes the semantics of the formula rather than just normalizing syntax. Since normalize_formula=True is the default, this corrupting behavior is on the default path.

Concrete failure scenario: a formula extracted as √2 ≤ x is exported as $$\n\sqrt{}2 \leq x\n$$, and downstream math rendering shows an empty radical plus 2, or fails to render correctly. The stored/exported markdown is wrong on every retry.

What I Would Not Keep At Critical/High

`outputmarkdowndiff.txt`

I agree it should be removed, but I would not rate it Critical. It looks like an accidental artifact and repo hygiene issue, not a production correctness/concurrency problem. If this review is strictly limited to critical/high, I would leave it out or mention it separately as non-blocking cleanup.

Version bump / changelog edit

I would not keep this as High without explicit repo policy saying feature PRs must not touch versioning. Merge-conflict risk is real, but that is usually workflow friction rather than a high-severity product bug. In many repos, feature branches intentionally update changelogs and versions, so this is too process-dependent to call high-confidence/high-severity from the diff alone.

Missing High-Value Tests

The highest-value missing tests are:

A regression test for old positional elements_to_md(..., encoding) usage.
A regression test for old positional create_file_from_elements(..., no_group_by_page) usage.
A √ normalization test such as Formula("√2") and Formula("√(x+1)").

Verdict

Do Not Merge

The merged critical/high set is:

High: silent positional API break in elements_to_md()
High: silent positional API break in create_file_from_elements()
High: incorrect √ normalization corrupts formula output

No Critical findings with high confidence.

PastelStorm

Findings

Severity: High
Description: Every Formula is now serialized as a display-math block, even when the extracted payload is clearly not valid LaTeX/math syntax.

def element_to_md(
    element: Element,
    exclude_binary_image_data: bool = False,
    normalize_formula: bool = True,
) -> str:
    match element:
        case Title(text=text):
            return f"# {text}"
        case Formula(text=text):
            formula_text = text.strip()
            if normalize_formula:
                formula_text = _normalize_formula_for_markdown(formula_text)
            return f"$$\n{formula_text}\n$$"

The corrosion rate (CR) was calculated using Eq. (1) [1–5]
$$
Corrosion rate CRð Þ ¼ 87:6W DAT ð1Þ
$$
where: W is weight loss in mg, A is specimen surface area, T is immersion period in hours and D is the specimen density. From the corrosion rate, the surface coverage (θ) and inhibition efﬁciencies (IE %) were determined using Eqs. (2) and (3) respectively
$$
_ CRo—CR O= OR
$$
$$
CRo=CR , 100 IE (0) = CR
$$

Why it is risky: This changes the abstraction from “emit extracted text” to “assert this text is safe display math.” Markdown consumers that run KaTeX/MathJax will now parse OCR-corrupted formula text as math source, which can fail to parse, render blank/error output, or distort surrounding content. The golden updates already show real extracted Formula payloads that are noisy OCR, not reliable math markup.

Concrete failure scenario: A production ingestion job exports markdown for PDFs, and a downstream service renders all $$...$$ blocks with KaTeX. A biomedical document containing OCR-heavy formulas like _ CRo—CR O= OR now produces math parse failures, so end users see missing equations or broken sections where the old code at least showed readable plain text.

Severity: High
Description: Raw Formula.text is inserted inside $$ fences with no escaping or sanitization of embedded math delimiters.

def element_to_md(
    element: Element,
    exclude_binary_image_data: bool = False,
    normalize_formula: bool = True,
) -> str:
    match element:
        case Title(text=text):
            return f"# {text}"
        case Formula(text=text):
            formula_text = text.strip()
            if normalize_formula:
                formula_text = _normalize_formula_for_markdown(formula_text)
            return f"$$\n{formula_text}\n$$"

Why it is risky: If extracted formula text itself contains $$, or delimiter-like sequences introduced by OCR, the serializer can terminate the display-math block early and leak the remainder back into normal markdown parsing. That is a structural corruption bug, not just a cosmetic rendering issue.

Concrete failure scenario: One tenant uploads a PDF whose extracted Formula.text contains a $$ b or a similar OCR artifact. The generated markdown closes the math block at the inner delimiter, and the rest of the document is reinterpreted as normal markdown. In a batch/high-concurrency pipeline this yields hard-to-debug, document-specific corruption that only appears for certain source files and downstream renderers.

Missing High-Value Tests

A Formula containing embedded $$, lone $, or multiline delimiter-like OCR noise, to prove markdown structure cannot be broken by the new serializer.
An end-to-end markdown-rendering test using one of the newly updated noisy fixtures, to verify the new $$ output is actually consumable by a real math renderer or to document that invalid math should remain plain text instead.
A policy test for malformed/non-LaTeX Formula content: either “wrap anyway,” “fallback to plain text,” or “reject/sanitize.” Right now the branch changes behavior without locking down that contract.

Verdict

Not Merge

I did not find any critical/high-confidence concurrency or shared-state bugs in the diff itself; the blocker is correctness and downstream production risk from unconditionally treating extracted Formula text as safe display math.

PastelStorm · 2026-04-03T16:04:52Z

@claytonlin1110 looks like it needs a bit of a cleanup and is missing a few tests. After that we should be good.

Findings

No remaining Critical or High issues found on the current head of PR #4308.

Missing High-Value Tests

A few worthwhile tests are still implied by the new API surface, even though I would not block merge on them at high severity:

An end-to-end JSON fixture test with real Formula elements through elements_from_json() -> elements_to_md(), not just synthetic Formula(...) units.
A regression test that create_file_from_elements(..., formula_markdown_style=...) actually propagates the style option through markdown output.
A full positional-signature regression for elements_to_md(elements, filename, exclude_binary_image_data, encoding, normalize_formula) to lock the new parameter order.

Verdict

Merge

One non-blocking note: the branch still contains outputmarkdowndiff.txt, which looks like a committed scratch artifact rather than product code. I’m not counting that as a critical/high code issue, but I would remove it before landing for branch hygiene.

claytonlin1110 · 2026-04-03T17:33:56Z

@PastelStorm Updated. Please review

PastelStorm · 2026-04-03T18:00:57Z

@claytonlin1110 sorry for the drip :|

Code Review: `claytonlin1110:feat/formula-markdown-latex`

This branch bundles four distinct features: (1) Formula element rendering as $$ display-math blocks in Markdown, (2) embedded CMap stream parsing for CIDFonts in pdfminer, (3) PDF rendering delegation to unstructured-inference, and (4) standardize_quotes refactor via str.translate. Below are findings of Critical and High severity only.

Finding 1 — HIGH: LaTeX command substitutions fuse with adjacent characters, producing invalid commands

File: unstructured/staging/base.py — _normalize_formula_for_markdown

The function performs bare str.replace() for Unicode→LaTeX mappings (e.g. "∈" → r"\in", "≤" → r"\leq"). When the source text has no whitespace around the symbol (common in OCR output), the LaTeX command name fuses with the adjacent character, creating an undefined command.

Why it's risky: LaTeX/KaTeX/MathJax tokenize \ + consecutive ASCII letters as a single command name. \inS is parsed as the undefined command \inS, not \in followed by S.

Failure scenario:

Input:  Formula("x∈S")
After:  "x\inS"  →  KaTeX error: \inS is not recognized

Same for a≤b → a\leqb, a≠b → a\neqb, etc. This affects every non-whitespace-delimited math symbol — a very common case in OCR-extracted formulas.

Fix: Append {} after each LaTeX command to terminate the name: "∈" → r"\in{}".

Finding 2 — HIGH: Normalization runs before the auto-detection heuristic, changing wrapping decisions

File: unstructured/staging/base.py — _emit_formula_markdown

Normalization (_normalize_formula_for_markdown) runs before _formula_auto_use_display_math. The heuristic scores Unicode symbols ([∈∉≤≥≠≈×÷∞∑∫√∂∇]) at +2 each, but after normalization those codepoints are gone and replaced by \leq etc., which trigger the different \\[a-zA-Z]+ check (flat +3). This shifts scores at boundary conditions.

Failure scenario:

Text (85 chars, contains "where"):
  "E ≤ threshold where E is the energy and threshold was determined experimentally"

Raw score: ≤ gives +2, total=2. Prose threshold=3. Score 2<3 → plain text. ✓
After normalization: \leq triggers \\[a-zA-Z]+ → +3, total=3. Score 3≥3 → wrapped in $$. ✗

User sees a prose sentence incorrectly wrapped as display math.

Fix: Run auto-detection on the raw text, then normalize only the text that will be wrapped.

Finding 3 — HIGH: Normalization alters text even when `formula_markdown_style="plain"`

File: unstructured/staging/base.py — _emit_formula_markdown

When style == "plain", the function returns text — but normalization has already mutated it. A caller using plain style reasonably expects unmodified text.

if normalize_formula:
    text = _normalize_formula_for_markdown(text)  # already mutated
# ...
if style == FORMULA_MARKDOWN_PLAIN:
    return text  # returns normalized, not raw

Failure scenario: element_to_md(Formula("a − b"), formula_markdown_style="plain") returns "a - b" (Unicode minus silently replaced with ASCII hyphen) instead of the original "a − b".

Fix: Check style before normalizing, or skip normalization when style == "plain".

Finding 4 — HIGH: Bare `except Exception` swallows critical errors silently

File: unstructured/partition/pdf_image/pdfminer_utils.py — CustomPDFCIDFont.get_cmap_from_spec

except Exception:
    logger.debug("Failed to parse embedded CMap stream %r", cmap_name)

This catches MemoryError, RecursionError, and any bug in the new parsing code, logging only at debug level without a traceback. In production (log level INFO+), this is completely invisible.

Failure scenario: A CMap contains odd-length hex like <F>. bytes.fromhex("F") raises ValueError. The exception is swallowed, the font falls back to an empty CMap, and all text for that font is silently lost — with zero production log visibility.

Fix: Narrow to specific exceptions (ValueError, KeyError, zlib.error). At minimum add exc_info=True and log at warning.

Finding 5 — HIGH: Partial CMap returned when mapping budget is exceeded

File: unstructured/partition/pdf_image/pdfminer_utils.py — _parse_embedded_cmap_stream

When a cidrange exceeds the remaining mapping budget, the code uses continue (skip this range, keep processing subsequent ranges). This can return a CMap with arbitrary gaps — some code points decode correctly, others are unmapped.

Failure scenario: A CJK CMap has ranges R1 (5K entries), R2 (130K entries), R3 (1K entries). R1 is accepted, R2 is skipped (exceeds 131K cap), R3 is accepted. The CMap is returned with R1+R3 but missing R2. Since code2cid is truthy, it's trusted. Most CJK characters (in R2) decode to wrong CIDs → garbled text output.

Fix: Return an empty CMap() once the budget is exceeded, rather than a partial map with holes. A partial CMap is worse than no CMap because the caller trusts non-empty code2cid.

Finding 6 — HIGH: `exactly_one(filename, file)` validation dropped

File: unstructured/partition/pdf_image/pdf_image_utils.py — convert_pdf_to_image

The old _render_pdf_pages called exactly_one(filename=filename, file=file) to reject calls where both arguments are provided. The replacement render_pdf_to_image from unstructured-inference only checks if both are None — it does not reject both-truthy.

Failure scenario: A caller passes both filename="old.pdf" and file=<in-memory bytes of new.pdf>. The old code raises immediately. The new code silently uses filename (via filename or file), ignoring the file bytes entirely. The caller gets page images from the wrong PDF with no error.

Fix: Add exactly_one(filename=filename, file=file) inside convert_pdf_to_image before delegating.

Finding 7 — HIGH: `CustomPDFResourceManager.get_font` duplicates parent cache/dispatch logic

File: unstructured/partition/pdf_image/pdfminer_utils.py

The override re-implements the cache lookup, strict-mode check, and subtype extraction from PDFResourceManager.get_font. If pdfminer's internal get_font logic changes (e.g., adding pre-processing, new font subtypes, or changed caching), the override diverges silently.

Risk: A pdfminer version bump adds a pre-processing step (e.g., resolving indirect references) that the override bypasses for CIDFont types. CIDFont construction receives un-resolved spec entries.

Missing High-Value Test Cases

LaTeX command boundary: Test Formula("x∈S") — expected $$\nx \in{} S\n$$ or similar, currently produces broken \inS.
Normalization vs. heuristic ordering: Test a prose-length formula with exactly one Unicode math symbol + a prose keyword, verifying it's NOT wrapped.
CMap budget exhaustion with multiple ranges: Test that a CMap exceeding the mapping cap returns an empty CMap (not a partial one).
Both filename and file provided to convert_pdf_to_image: Should raise ValueError.
Encrypted CMap stream: Test that decipher path in _decode_pdfstream_with_limit handles missing objid/genno.

Positive Notes (not in scope, but worth acknowledging)

The standardize_quotes refactor is strictly correct — fixes two silent data-loss bugs (U+201C, U+2018 were dropped by duplicate dict keys) and the ",," entry was dead code in the old implementation. No issues found.
The thread-safety concern about removing _pdfium_lock is not a regression — render_pdf_to_image in unstructured-inference has its own lock, and consolidating to one lock is actually an improvement.
The embedded CMap parsing logic is well-bounded with size caps and mapping limits (modulo Finding 5 above).
The test coverage for the new CMap parsing is thorough.

Verdict: Not Merge

Findings 1–3 are correctness bugs in the core Formula→Markdown feature (the primary purpose of this PR). The LaTeX command boundary issue (Finding 1) will produce broken output for common OCR inputs. The normalization-before-heuristic bug (Finding 2) causes incorrect wrapping decisions at boundary conditions. These should be fixed and re-verified before merging. The remaining findings (4–7) in the pdfminer and pdf_image_utils areas add production risk but are individually addressable.

claytonlin1110 force-pushed the feat/formula-markdown-latex branch from 17e10f5 to 199318a Compare March 31, 2026 19:58

PastelStorm suggested changes Apr 1, 2026

View reviewed changes

claytonlin1110 requested a review from PastelStorm April 1, 2026 03:37

PastelStorm suggested changes Apr 1, 2026

View reviewed changes

claytonlin1110 requested a review from PastelStorm April 1, 2026 04:35

claytonlin1110 force-pushed the feat/formula-markdown-latex branch from ec3c4b9 to 1c7d90f Compare April 2, 2026 04:33

claytonlin1110 added 8 commits April 3, 2026 12:09

feat: render Formula elements as $$ blocks with optional normalization

1f2caab

fix: update

8fa6c95

fix: version

9bb082f

fix: update

3537c6c

fix: lint

5b4928f

fix: update

a4b1b0b

fix: update

db042bf

fix: update

b36ef86

claytonlin1110 force-pushed the feat/formula-markdown-latex branch from a60c8c1 to b36ef86 Compare April 3, 2026 17:13

claytonlin1110 added 2 commits April 3, 2026 12:17

fix: update

54d9ee5

fix: update

5fa2bde

fix: update

f4bd0cf

PastelStorm enabled auto-merge April 3, 2026 20:11

PastelStorm approved these changes Apr 3, 2026

View reviewed changes

PastelStorm added this pull request to the merge queue Apr 3, 2026

Merged via the queue into Unstructured-IO:main with commit 264d569 Apr 3, 2026
53 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: render Formula elements as $$ blocks with optional normalization#4308

feat: render Formula elements as $$ blocks with optional normalization#4308
PastelStorm merged 11 commits into
Unstructured-IO:mainfrom
claytonlin1110:feat/formula-markdown-latex

claytonlin1110 commented Mar 31, 2026

Uh oh!

PastelStorm left a comment

Uh oh!

PastelStorm left a comment

Uh oh!

PastelStorm commented Apr 3, 2026

Uh oh!

claytonlin1110 commented Apr 3, 2026

Uh oh!

PastelStorm commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

claytonlin1110 commented Mar 31, 2026

Summary

Uh oh!

PastelStorm left a comment

Choose a reason for hiding this comment

Review

Severity: High

Severity: High

Severity: High

What I Would Not Keep At Critical/High

outputmarkdowndiff.txt

Version bump / changelog edit

Missing High-Value Tests

Verdict

Uh oh!

PastelStorm left a comment

Choose a reason for hiding this comment

Findings

Missing High-Value Tests

Verdict

Uh oh!

PastelStorm commented Apr 3, 2026

Findings

Missing High-Value Tests

Verdict

Uh oh!

claytonlin1110 commented Apr 3, 2026

Uh oh!

PastelStorm commented Apr 3, 2026

Code Review: claytonlin1110:feat/formula-markdown-latex

Finding 1 — HIGH: LaTeX command substitutions fuse with adjacent characters, producing invalid commands

Finding 2 — HIGH: Normalization runs before the auto-detection heuristic, changing wrapping decisions

Finding 3 — HIGH: Normalization alters text even when formula_markdown_style="plain"

Finding 4 — HIGH: Bare except Exception swallows critical errors silently

Finding 5 — HIGH: Partial CMap returned when mapping budget is exceeded

Finding 6 — HIGH: exactly_one(filename, file) validation dropped

Finding 7 — HIGH: CustomPDFResourceManager.get_font duplicates parent cache/dispatch logic

Missing High-Value Test Cases

Positive Notes (not in scope, but worth acknowledging)

Verdict: Not Merge

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Severity: `High`

Severity: `High`

Severity: `High`

`outputmarkdowndiff.txt`

Code Review: `claytonlin1110:feat/formula-markdown-latex`

Finding 3 — HIGH: Normalization alters text even when `formula_markdown_style="plain"`

Finding 4 — HIGH: Bare `except Exception` swallows critical errors silently

Finding 6 — HIGH: `exactly_one(filename, file)` validation dropped

Finding 7 — HIGH: `CustomPDFResourceManager.get_font` duplicates parent cache/dispatch logic