Skip to content

Conversation

@dmsnell
Copy link
Member

@dmsnell dmsnell commented Jul 15, 2025

Trac ticket: Core-63694
Replaces #6651
See: (#9270), #9850, #9851

Design feedback

  • Core has previously considered HTML like <[[gallery]]> to be an escaped shortcode inside an HTML tag, but HTML considers it plaintext instead of a tag (because the starting character after the initial < is not a letter).
    • To match this behavior we can special-case text nodes which look like tags, but should we? This comes up in shortcode processing which decides not to replace shortcakes inside tags. So the ultimate question is:
      a. Is this actually a shortcode inside a tag to be ignored?
      b. Is this a shortcode inside a text node?
    • HTML provides the second answer (b). WordPress’ answer is contextual.
      • If it were <[gallery]> and the [gallery] shortcode translated into a tag name then this entire thing would become a tag on replacement.
      • If it translated into a non-tag-name, however, the replacement would remain plaintext.

Implementation

This probably improves the performance in terms of both CPU time and memory compared to the old PCRE-based approach.

@github-actions
Copy link

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

@dmsnell dmsnell force-pushed the html-api/refactor-html-split-regex branch 5 times, most recently from 1410116 to 0440d83 Compare July 15, 2025 22:19
@sirreal
Copy link
Member

sirreal commented Jul 29, 2025

I believe this would fix https://core.trac.wordpress.org/ticket/45387.

@github-actions
Copy link

A commit was made that fixes the Trac ticket referenced in the description of this pull request.

SVN changeset: 60665
GitHub commit: 2abe245

This PR will be closed, but please confirm the accuracy of this and reopen if there is more work to be done.

@github-actions github-actions bot closed this Aug 26, 2025
@dmsnell dmsnell reopened this Aug 26, 2025
@github-actions
Copy link

A commit was made that fixes the Trac ticket referenced in the description of this pull request.

SVN changeset: 60726
GitHub commit: 00b3b63

This PR will be closed, but please confirm the accuracy of this and reopen if there is more work to be done.

@github-actions github-actions bot closed this Sep 10, 2025
@dmsnell
Copy link
Member Author

dmsnell commented Sep 10, 2025

@github-actions why don’t I come in and mess with all of your work unsolicited, huh?

$regex = get_html_split_regex();
$result = benchmark_pcre_backtracking( $regex, $input, 'split' );
return $this->assertLessThan( 200, $result );
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no longer a PCRE used in wp_html_split() and therefore no backtracking.

@dmsnell dmsnell force-pushed the html-api/refactor-html-split-regex branch 5 times, most recently from d319642 to c5b62b8 Compare September 18, 2025 19:59
@dmsnell dmsnell force-pushed the html-api/refactor-html-split-regex branch 3 times, most recently from a51cc59 to 8a5805f Compare September 23, 2025 20:21
@dmsnell dmsnell force-pushed the html-api/refactor-html-split-regex branch 3 times, most recently from e17de28 to 479b18a Compare September 30, 2025 20:57
@dmsnell dmsnell force-pushed the html-api/refactor-html-split-regex branch 5 times, most recently from a52a8f9 to fcac561 Compare October 8, 2025 04:42
@dmsnell dmsnell force-pushed the html-api/refactor-html-split-regex branch 3 times, most recently from c06e2c8 to e4f3798 Compare October 9, 2025 23:39
@dmsnell dmsnell force-pushed the html-api/refactor-html-split-regex branch 2 times, most recently from ff20347 to 085390b Compare October 21, 2025 08:33
Trac ticket: Core-63694

This probably improves the performance in terms of both CPU time and
memory compared to the old PCRE-based approach.
Was detecting a non-escaped `<` as the start of an “element” and
then replaced a newline in the text as `<!-- wpnl -->` since it
thought it was replacing inside a tag. In the end that translated
into a raw `\n` again in the end.
@dmsnell dmsnell force-pushed the html-api/refactor-html-split-regex branch from fcb6b14 to f8a1e05 Compare October 21, 2025 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants