Skip to content

feat: render Formula elements as $$ blocks with optional normalization#4308

Merged
PastelStorm merged 11 commits into
Unstructured-IO:mainfrom
claytonlin1110:feat/formula-markdown-latex
Apr 3, 2026
Merged

feat: render Formula elements as $$ blocks with optional normalization#4308
PastelStorm merged 11 commits into
Unstructured-IO:mainfrom
claytonlin1110:feat/formula-markdown-latex

Conversation

@claytonlin1110
Copy link
Copy Markdown
Contributor

Summary

  • Render Formula elements as Markdown math blocks ($$ ... $$) during markdown staging export.
  • Add optional normalize_formula flag (default True) to control conservative Unicode-math to LaTeX-like normalization.
  • Add tests for:
    • formula block rendering
    • normalization on/off behavior
    • passthrough of normalize_formula via create_file_from_elements()
  • Update changelog and bump package version to 0.22.11.

Closes #3868

@claytonlin1110 claytonlin1110 force-pushed the feat/formula-markdown-latex branch from 17e10f5 to 199318a Compare March 31, 2026 19:58
Copy link
Copy Markdown
Contributor

@PastelStorm PastelStorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Severity: High

Description: elements_to_md() silently breaks backward compatibility for positional callers by inserting normalize_formula before encoding.

def elements_to_md(
    elements: Iterable[Element],
    filename: Optional[str] = None,
    exclude_binary_image_data: bool = False,
    normalize_formula: bool = True,
    encoding: str = "utf-8",
) -> str:
def elements_to_md(
    elements: Iterable[Element],
    filename: Optional[str] = None,
    exclude_binary_image_data: bool = False,
    encoding: str = "utf-8",
) -> str:

Why it is risky: old valid calls like elements_to_md(elements, path, False, "latin-1") now bind "latin-1" to normalize_formula instead of encoding. The call succeeds, so the regression is silent, but writes with UTF-8 instead of the requested encoding.

Concrete failure scenario: a document-export worker writes markdown for a downstream Latin-1 consumer. After this PR, retries all succeed but produce incorrectly encoded output, causing corrupted artifacts or decode failures downstream.

Severity: High

Description: create_file_from_elements() has the same silent positional-argument break by inserting normalize_formula before no_group_by_page.

def create_file_from_elements(
    elements: Iterable[Element],
    output_format: str = "markdown",
    filename: Optional[str] = None,
    encoding: str = "utf-8",
    exclude_binary_image_data: bool = False,
    normalize_formula: bool = True,
    no_group_by_page: bool = True,
) -> str:
def create_file_from_elements(
    elements: Iterable[Element],
    output_format: str = "markdown",
    filename: Optional[str] = None,
    encoding: str = "utf-8",
    exclude_binary_image_data: bool = False,
    no_group_by_page: bool = True,
) -> str:

Why it is risky: older positional callers now pass their intended no_group_by_page value into normalize_formula, and no_group_by_page silently reverts to its default. That changes output semantics with no exception.

Concrete failure scenario: an HTML export path calls create_file_from_elements(elements, "html", out, "utf-8", False, False) expecting grouped-by-page behavior. After this change, the last False disables formula normalization instead, while HTML grouping silently changes, so production exports return structurally different HTML under load and retries reproduce the same bad output.

Severity: High

Description: your finding is real. Mapping to \sqrt{} is not conservative normalization; it corrupts the math expression unless the radicand is explicitly reparsed and wrapped.

def _normalize_formula_for_markdown(text: str) -> str:
    ...
    substitutions = {
        "−": "-",  # Unicode minus -> ASCII hyphen-minus
        "×": r"\times",
        ...
        "√": r"\sqrt{}",
    }
    normalized = text
    for source, target in substitutions.items():
        normalized = normalized.replace(source, target)
    return normalized

Why it is risky: √x becomes \sqrt{}x, which changes the semantics of the formula rather than just normalizing syntax. Since normalize_formula=True is the default, this corrupting behavior is on the default path.

Concrete failure scenario: a formula extracted as √2 ≤ x is exported as $$\n\sqrt{}2 \leq x\n$$, and downstream math rendering shows an empty radical plus 2, or fails to render correctly. The stored/exported markdown is wrong on every retry.

What I Would Not Keep At Critical/High

outputmarkdowndiff.txt

I agree it should be removed, but I would not rate it Critical. It looks like an accidental artifact and repo hygiene issue, not a production correctness/concurrency problem. If this review is strictly limited to critical/high, I would leave it out or mention it separately as non-blocking cleanup.

Version bump / changelog edit

I would not keep this as High without explicit repo policy saying feature PRs must not touch versioning. Merge-conflict risk is real, but that is usually workflow friction rather than a high-severity product bug. In many repos, feature branches intentionally update changelogs and versions, so this is too process-dependent to call high-confidence/high-severity from the diff alone.

Missing High-Value Tests

The highest-value missing tests are:

  • A regression test for old positional elements_to_md(..., encoding) usage.
  • A regression test for old positional create_file_from_elements(..., no_group_by_page) usage.
  • A normalization test such as Formula("√2") and Formula("√(x+1)").

Verdict

Do Not Merge

The merged critical/high set is:

  • High: silent positional API break in elements_to_md()
  • High: silent positional API break in create_file_from_elements()
  • High: incorrect normalization corrupts formula output

No Critical findings with high confidence.

Copy link
Copy Markdown
Contributor

@PastelStorm PastelStorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

Severity: High
Description: Every Formula is now serialized as a display-math block, even when the extracted payload is clearly not valid LaTeX/math syntax.

def element_to_md(
    element: Element,
    exclude_binary_image_data: bool = False,
    normalize_formula: bool = True,
) -> str:
    match element:
        case Title(text=text):
            return f"# {text}"
        case Formula(text=text):
            formula_text = text.strip()
            if normalize_formula:
                formula_text = _normalize_formula_for_markdown(formula_text)
            return f"$$\n{formula_text}\n$$"
The corrosion rate (CR) was calculated using Eq. (1) [15]
$$
Corrosion rate CRð Þ ¼ 87:6W DAT ð1Þ
$$
where: W is weight loss in mg, A is specimen surface area, T is immersion period in hours and D is the specimen density. From the corrosion rate, the surface coverage (θ) and inhibition efficiencies (IE %) were determined using Eqs. (2) and (3) respectively
$$
_ CRo—CR O= OR
$$
$$
CRo=CR , 100 IE (0) = CR
$$

Why it is risky: This changes the abstraction from “emit extracted text” to “assert this text is safe display math.” Markdown consumers that run KaTeX/MathJax will now parse OCR-corrupted formula text as math source, which can fail to parse, render blank/error output, or distort surrounding content. The golden updates already show real extracted Formula payloads that are noisy OCR, not reliable math markup.

Concrete failure scenario: A production ingestion job exports markdown for PDFs, and a downstream service renders all $$...$$ blocks with KaTeX. A biomedical document containing OCR-heavy formulas like _ CRo—CR O= OR now produces math parse failures, so end users see missing equations or broken sections where the old code at least showed readable plain text.


Severity: High
Description: Raw Formula.text is inserted inside $$ fences with no escaping or sanitization of embedded math delimiters.

def element_to_md(
    element: Element,
    exclude_binary_image_data: bool = False,
    normalize_formula: bool = True,
) -> str:
    match element:
        case Title(text=text):
            return f"# {text}"
        case Formula(text=text):
            formula_text = text.strip()
            if normalize_formula:
                formula_text = _normalize_formula_for_markdown(formula_text)
            return f"$$\n{formula_text}\n$$"

Why it is risky: If extracted formula text itself contains $$, or delimiter-like sequences introduced by OCR, the serializer can terminate the display-math block early and leak the remainder back into normal markdown parsing. That is a structural corruption bug, not just a cosmetic rendering issue.

Concrete failure scenario: One tenant uploads a PDF whose extracted Formula.text contains a $$ b or a similar OCR artifact. The generated markdown closes the math block at the inner delimiter, and the rest of the document is reinterpreted as normal markdown. In a batch/high-concurrency pipeline this yields hard-to-debug, document-specific corruption that only appears for certain source files and downstream renderers.

Missing High-Value Tests

  • A Formula containing embedded $$, lone $, or multiline delimiter-like OCR noise, to prove markdown structure cannot be broken by the new serializer.
  • An end-to-end markdown-rendering test using one of the newly updated noisy fixtures, to verify the new $$ output is actually consumable by a real math renderer or to document that invalid math should remain plain text instead.
  • A policy test for malformed/non-LaTeX Formula content: either “wrap anyway,” “fallback to plain text,” or “reject/sanitize.” Right now the branch changes behavior without locking down that contract.

Verdict

Not Merge

I did not find any critical/high-confidence concurrency or shared-state bugs in the diff itself; the blocker is correctness and downstream production risk from unconditionally treating extracted Formula text as safe display math.

@claytonlin1110 claytonlin1110 force-pushed the feat/formula-markdown-latex branch from ec3c4b9 to 1c7d90f Compare April 2, 2026 04:33
@PastelStorm
Copy link
Copy Markdown
Contributor

@claytonlin1110 looks like it needs a bit of a cleanup and is missing a few tests. After that we should be good.

Findings

No remaining Critical or High issues found on the current head of PR #4308.

Missing High-Value Tests

A few worthwhile tests are still implied by the new API surface, even though I would not block merge on them at high severity:

  • An end-to-end JSON fixture test with real Formula elements through elements_from_json() -> elements_to_md(), not just synthetic Formula(...) units.
  • A regression test that create_file_from_elements(..., formula_markdown_style=...) actually propagates the style option through markdown output.
  • A full positional-signature regression for elements_to_md(elements, filename, exclude_binary_image_data, encoding, normalize_formula) to lock the new parameter order.

Verdict

Merge

One non-blocking note: the branch still contains outputmarkdowndiff.txt, which looks like a committed scratch artifact rather than product code. I’m not counting that as a critical/high code issue, but I would remove it before landing for branch hygiene.

@claytonlin1110 claytonlin1110 force-pushed the feat/formula-markdown-latex branch from a60c8c1 to b36ef86 Compare April 3, 2026 17:13
@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@PastelStorm Updated. Please review

@PastelStorm
Copy link
Copy Markdown
Contributor

@claytonlin1110 sorry for the drip :|


Code Review: claytonlin1110:feat/formula-markdown-latex

This branch bundles four distinct features: (1) Formula element rendering as $$ display-math blocks in Markdown, (2) embedded CMap stream parsing for CIDFonts in pdfminer, (3) PDF rendering delegation to unstructured-inference, and (4) standardize_quotes refactor via str.translate. Below are findings of Critical and High severity only.


Finding 1 — HIGH: LaTeX command substitutions fuse with adjacent characters, producing invalid commands

File: unstructured/staging/base.py_normalize_formula_for_markdown

The function performs bare str.replace() for Unicode→LaTeX mappings (e.g. "∈"r"\in", "≤"r"\leq"). When the source text has no whitespace around the symbol (common in OCR output), the LaTeX command name fuses with the adjacent character, creating an undefined command.

Why it's risky: LaTeX/KaTeX/MathJax tokenize \ + consecutive ASCII letters as a single command name. \inS is parsed as the undefined command \inS, not \in followed by S.

Failure scenario:

Input:  Formula("x∈S")
After:  "x\inS"  →  KaTeX error: \inS is not recognized

Same for a≤ba\leqb, a≠ba\neqb, etc. This affects every non-whitespace-delimited math symbol — a very common case in OCR-extracted formulas.

Fix: Append {} after each LaTeX command to terminate the name: "∈"r"\in{}".


Finding 2 — HIGH: Normalization runs before the auto-detection heuristic, changing wrapping decisions

File: unstructured/staging/base.py_emit_formula_markdown

Normalization (_normalize_formula_for_markdown) runs before _formula_auto_use_display_math. The heuristic scores Unicode symbols ([∈∉≤≥≠≈×÷∞∑∫√∂∇]) at +2 each, but after normalization those codepoints are gone and replaced by \leq etc., which trigger the different \\[a-zA-Z]+ check (flat +3). This shifts scores at boundary conditions.

Failure scenario:

Text (85 chars, contains "where"):
  "E ≤ threshold where E is the energy and threshold was determined experimentally"

Raw score: ≤ gives +2, total=2. Prose threshold=3. Score 2<3 → plain text. ✓
After normalization: \leq triggers \\[a-zA-Z]+ → +3, total=3. Score 3≥3 → wrapped in $$. ✗

User sees a prose sentence incorrectly wrapped as display math.

Fix: Run auto-detection on the raw text, then normalize only the text that will be wrapped.


Finding 3 — HIGH: Normalization alters text even when formula_markdown_style="plain"

File: unstructured/staging/base.py_emit_formula_markdown

When style == "plain", the function returns text — but normalization has already mutated it. A caller using plain style reasonably expects unmodified text.

if normalize_formula:
    text = _normalize_formula_for_markdown(text)  # already mutated
# ...
if style == FORMULA_MARKDOWN_PLAIN:
    return text  # returns normalized, not raw

Failure scenario: element_to_md(Formula("a − b"), formula_markdown_style="plain") returns "a - b" (Unicode minus silently replaced with ASCII hyphen) instead of the original "a − b".

Fix: Check style before normalizing, or skip normalization when style == "plain".


Finding 4 — HIGH: Bare except Exception swallows critical errors silently

File: unstructured/partition/pdf_image/pdfminer_utils.pyCustomPDFCIDFont.get_cmap_from_spec

except Exception:
    logger.debug("Failed to parse embedded CMap stream %r", cmap_name)

This catches MemoryError, RecursionError, and any bug in the new parsing code, logging only at debug level without a traceback. In production (log level INFO+), this is completely invisible.

Failure scenario: A CMap contains odd-length hex like <F>. bytes.fromhex("F") raises ValueError. The exception is swallowed, the font falls back to an empty CMap, and all text for that font is silently lost — with zero production log visibility.

Fix: Narrow to specific exceptions (ValueError, KeyError, zlib.error). At minimum add exc_info=True and log at warning.


Finding 5 — HIGH: Partial CMap returned when mapping budget is exceeded

File: unstructured/partition/pdf_image/pdfminer_utils.py_parse_embedded_cmap_stream

When a cidrange exceeds the remaining mapping budget, the code uses continue (skip this range, keep processing subsequent ranges). This can return a CMap with arbitrary gaps — some code points decode correctly, others are unmapped.

Failure scenario: A CJK CMap has ranges R1 (5K entries), R2 (130K entries), R3 (1K entries). R1 is accepted, R2 is skipped (exceeds 131K cap), R3 is accepted. The CMap is returned with R1+R3 but missing R2. Since code2cid is truthy, it's trusted. Most CJK characters (in R2) decode to wrong CIDs → garbled text output.

Fix: Return an empty CMap() once the budget is exceeded, rather than a partial map with holes. A partial CMap is worse than no CMap because the caller trusts non-empty code2cid.


Finding 6 — HIGH: exactly_one(filename, file) validation dropped

File: unstructured/partition/pdf_image/pdf_image_utils.pyconvert_pdf_to_image

The old _render_pdf_pages called exactly_one(filename=filename, file=file) to reject calls where both arguments are provided. The replacement render_pdf_to_image from unstructured-inference only checks if both are None — it does not reject both-truthy.

Failure scenario: A caller passes both filename="old.pdf" and file=<in-memory bytes of new.pdf>. The old code raises immediately. The new code silently uses filename (via filename or file), ignoring the file bytes entirely. The caller gets page images from the wrong PDF with no error.

Fix: Add exactly_one(filename=filename, file=file) inside convert_pdf_to_image before delegating.


Finding 7 — HIGH: CustomPDFResourceManager.get_font duplicates parent cache/dispatch logic

File: unstructured/partition/pdf_image/pdfminer_utils.py

The override re-implements the cache lookup, strict-mode check, and subtype extraction from PDFResourceManager.get_font. If pdfminer's internal get_font logic changes (e.g., adding pre-processing, new font subtypes, or changed caching), the override diverges silently.

Risk: A pdfminer version bump adds a pre-processing step (e.g., resolving indirect references) that the override bypasses for CIDFont types. CIDFont construction receives un-resolved spec entries.


Missing High-Value Test Cases

  1. LaTeX command boundary: Test Formula("x∈S") — expected $$\nx \in{} S\n$$ or similar, currently produces broken \inS.
  2. Normalization vs. heuristic ordering: Test a prose-length formula with exactly one Unicode math symbol + a prose keyword, verifying it's NOT wrapped.
  3. CMap budget exhaustion with multiple ranges: Test that a CMap exceeding the mapping cap returns an empty CMap (not a partial one).
  4. Both filename and file provided to convert_pdf_to_image: Should raise ValueError.
  5. Encrypted CMap stream: Test that decipher path in _decode_pdfstream_with_limit handles missing objid/genno.

Positive Notes (not in scope, but worth acknowledging)

  • The standardize_quotes refactor is strictly correct — fixes two silent data-loss bugs (U+201C, U+2018 were dropped by duplicate dict keys) and the ",," entry was dead code in the old implementation. No issues found.
  • The thread-safety concern about removing _pdfium_lock is not a regressionrender_pdf_to_image in unstructured-inference has its own lock, and consolidating to one lock is actually an improvement.
  • The embedded CMap parsing logic is well-bounded with size caps and mapping limits (modulo Finding 5 above).
  • The test coverage for the new CMap parsing is thorough.

Verdict: Not Merge

Findings 1–3 are correctness bugs in the core Formula→Markdown feature (the primary purpose of this PR). The LaTeX command boundary issue (Finding 1) will produce broken output for common OCR inputs. The normalization-before-heuristic bug (Finding 2) causes incorrect wrapping decisions at boundary conditions. These should be fixed and re-verified before merging. The remaining findings (4–7) in the pdfminer and pdf_image_utils areas add production risk but are individually addressable.

@PastelStorm PastelStorm enabled auto-merge April 3, 2026 20:11
@PastelStorm PastelStorm added this pull request to the merge queue Apr 3, 2026
Merged via the queue into Unstructured-IO:main with commit 264d569 Apr 3, 2026
53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat/formula-detection

2 participants