Skip to content

Conversation

@dmsnell
Copy link
Member

@dmsnell dmsnell commented Jun 28, 2025

Replaces #6760

Trac ticket: Core-61401
See: (#9105), #9841, #9842 -> #10291, #9843 -> #10292

Todo

  • Update parse_blocks() docs and explain when to use which interface

Description

The Block Scanner follows the HTML API in providing a streaming,
near-zero-overhead, lazy, re-entrant parser for traversing block
structure. This class provides an alternate interface to
parse_blocks() which is more amenable to a number of common
server-side operations on posts, such as:

  • Generating an excerpt from only the first N blocks in a post.
  • Determining which block types are present in a post.
  • Determining which posts contain a block of a given type.
  • Generating block supports content for a post.

Planned refactors

  • traverse_and_serialize_blocks() / block hooks
  • finding post galleries

@github-actions
Copy link

github-actions bot commented Jun 28, 2025

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, soean, tjnowell, westonruter, jonsurrell, gziolo.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@github-actions
Copy link

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

@dmsnell dmsnell marked this pull request as draft October 6, 2025 21:05
@dmsnell dmsnell force-pushed the blocks/add-block-scanner branch from 82c693a to 5f99617 Compare October 6, 2025 21:17
@dmsnell
Copy link
Member Author

dmsnell commented Oct 6, 2025

@gziolo something I was thinking about on my walk this morning was an idea I’ve ignored for a while, but maybe is worth it here: by default we skip freeform HTML content. the way we could signal to visit it is not with a boolean attribute, but using a special block type in next_delimiter().

// visit all explicit block delimiters
$scanner->next_delimiter();

// visit all block delimiters plus freeform blocks.
$scanner->next_delimiter( '*' );

// visit all freeform blocks: a possible special token meaning “no explicit block”
$scanner->next_delimiter( '#freeform' );

// visit all HTML spans, including `innerHTML`
$scanner->next_token();

this would collapse the difference between next_token() and next_delimiter() to whether they visit inner HTML. for any performance ideas about visiting freeform blocks, we could indicate that from within next_delimiter() and keep it internal.

I will try and explore this angle.

@dmsnell dmsnell force-pushed the blocks/add-block-scanner branch 3 times, most recently from fa33ff1 to 53e2769 Compare October 7, 2025 04:24
@gziolo
Copy link
Member

gziolo commented Oct 7, 2025

by default we skip freeform HTML content. the way we could signal to visit it is not with a boolean attribute, but using a special block type in next_delimiter().

That's interesting. It creates a nice logical separation between higher-level next_delimiter and lo-level next_token.

// visit all freeform blocks: a possible special token meaning “no explicit block”
$scanner->next_delimiter( '#freeform' );

We have this concept of freeform content handler in JavaScript which you can even modify with wp.blocks.setFreeformContentHandlerName and it defaults to core/freeform. I don't think we ever did anything like that on the server, so either #freeform or core/freeform resonates with me. Maybe folks could even be able to set this alias because I noticed that, for example, in the widgets editor, the freeform handler gets set to core/html. It probably is safer to go more meta with #freeform or ::freeform (like CSS pseudo-elements).

@dmsnell
Copy link
Member Author

dmsnell commented Oct 7, 2025

We have this concept of freeform content handler in JavaScript which you can even modify with wp.blocks.setFreeformContentHandlerName and it defaults to core/freeform. I don't think we ever did anything like that on the server

That’s something I wish had been more hard-coded and less extensible, based on my work on the two-stage block parsing and loading inside the block editor, both in WordPress and in Tumblr. But I guess JS is different because we have to load something into the editor, whereas on the server a non-block can remain not a block.

Since yesterday I’ve been using * and deferring an approach which visits only the HTML spans. Something that makes me nervous about using #freeform is that we have a relatively short timeline if we want to get this in for 6.9 and properly choosing a name for those seems rushed. It would be an easy thing to add later, I would hope. Something I didn’t mention is why I proposed that syntax:

  • block names cannot start with or have the octothorp character anywhere
  • this mirrors the #text and other token type descriptions in the HTML API

When I left off last night I was working on my extract_block() method, whose tests were failing. I think there is still some ambiguity in the code around the distinction between innerHTML and freeform blocks. What I plan on doing today is to try and do something I’ve been avoiding because of the increased internal state requirements: track HTML spans and pause when we match them. Currently this has all worked by recognizing HTML spans only after matching the next delimiter. With a special state value we can then visit the HTML and move on without re-scanning. There are some complications doing this with incomplete inputs as well, so it’s probably good and time to make this change.

Hopefully that clears up all of the remaining issues.

@dmsnell dmsnell force-pushed the blocks/add-block-scanner branch 4 times, most recently from 98e19bc to 853e3a6 Compare October 9, 2025 20:53
@dmsnell dmsnell marked this pull request as ready for review October 9, 2025 20:53
@dmsnell
Copy link
Member Author

dmsnell commented Oct 9, 2025

This has been updated significantly to clarify the interface and to merge the block scanner and block processor into one.

It would be really helpful to have review over the public methods, their docblocks, and the usage in the test suite.

I plan on going back to update the class/module docblock with more substantial general guidance for the class.

Also I plan on doing a separate type annotation pass, so please disregard any comments currently on type annotations.

@dmsnell dmsnell force-pushed the blocks/add-block-scanner branch from 853e3a6 to 4b027f3 Compare October 9, 2025 23:39
Copy link
Member

@westonruter westonruter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple nits

@dmsnell dmsnell force-pushed the blocks/add-block-scanner branch from 4b027f3 to 3fee490 Compare October 10, 2025 02:49
@swissspidy swissspidy changed the title Blocks: Introduce WP_Block_Scanner for efficiently parsing blocks. Blocks: Introduce WP_Block_Processor for efficiently parsing blocks. Oct 10, 2025
@dmsnell
Copy link
Member Author

dmsnell commented Oct 15, 2025

With significant haste in the eleventh hour I have finished making some major changes to this proposal:

  • There is now only a single WP_Block_Processor class (thanks @swissspidy for renaming the PR).
  • This singular class differentiates between freeform content and inner HTML, making it possible for calling code to inquire which is which.
    • A new block type wildcard * can be passed to the next_ functions to ensure that they visit top-level freeform content.
  • The docs have all been updated, names have been changed, and type annotations have been added.
  • The processor treats HTML spans like void blocks; previously it created a separate opener and closer.
  • Almost all references to static:: have been replaced by self::. After consideration, as a default first step I think it would be wise to treat most of a this class as final and discourage subclasses from reimplementing methods like next_token() the way subclasses do in the HTML API. Without needing a higher-level subclass this makes it easier.
    • The exception are a few methods intended to be replaced by subclasses, including the missing lazy JSON parser for static::get_attributes().

Personally I’m quite happy with this revision and I thank you all for your input on this. I think it’s in a much more coherent and usable state now. At this point I believe it’s ready for final review and merge, though I think it would be best only to block merging now for design issues with the class, the methods, the things which must be maintained ad infinitum. For other issues, I would prefer if we can address them during the beta phase after merging.

Of course, I invite all feedback and will address things if I can in a timely manner, but right now I would love to see this out there in trunk to get it in use and testing.

…ordPress#9105)

The Block Processor follows the HTML API in providing a streaming,
near-zero-overhead, lazy, re-entrant parser for traversing block
structure. This class provides an alternate interface to
`parse_blocks()` which is more amenable to a number of common
server-side operations on posts, such as:

 - Generating an excerpt from only the first N blocks in a post.
 - Determining which block types are present in a post.
 - Determining which posts contain a block of a given type.
 - Generating block supports content for a post.
 - Modifying a single block, or only blocks of a given kind in a post.

Co-authored-by: Sören Wünsch <[email protected]>
Co-authored-by: Tom J Nowell <[email protected]>
Co-authored-by: Weston Ruter <[email protected]>
Co-authored-by: Jon Surrell <[email protected]>
Co-authored-by: Greg Ziółkowski <[email protected]>
Github-PR: 9105
Github-PR-URL: WordPress#9105
Trac-Ticket: 61401
Trac-Ticket-URL: https://core.trac.wordpress.org/ticket/61401
@dmsnell dmsnell force-pushed the blocks/add-block-scanner branch from 4625a42 to 549ee7f Compare October 15, 2025 21:01
pento pushed a commit that referenced this pull request Oct 15, 2025
The Block Processor follows the HTML API in providing a streaming, near-zero-overhead, lazy, re-entrant parser for traversing block structure. This class provides an alternate interface to `parse_blocks()` which is more amenable to a number of common server-side operations on posts, especially those needing to operate on only a part of a full post.

Developed in #9105
Discussed in https://core.trac.wordpress.org/ticket/61401

Props dmsnell, gziolo, jonsurrell, soean, tjnowell, westonruter.
Fixes #61401.


git-svn-id: https://develop.svn.wordpress.org/trunk@60939 602fd350-edb4-49c9-b593-d223f7449a82
@github-actions
Copy link

A commit was made that fixes the Trac ticket referenced in the description of this pull request.

SVN changeset: 60939
GitHub commit: cf56ccb

This PR will be closed, but please confirm the accuracy of this and reopen if there is more work to be done.

@github-actions github-actions bot closed this Oct 15, 2025
markjaquith pushed a commit to markjaquith/WordPress that referenced this pull request Oct 15, 2025
The Block Processor follows the HTML API in providing a streaming, near-zero-overhead, lazy, re-entrant parser for traversing block structure. This class provides an alternate interface to `parse_blocks()` which is more amenable to a number of common server-side operations on posts, especially those needing to operate on only a part of a full post.

Developed in WordPress/wordpress-develop#9105
Discussed in https://core.trac.wordpress.org/ticket/61401

Props dmsnell, gziolo, jonsurrell, soean, tjnowell, westonruter.
Fixes #61401.

Built from https://develop.svn.wordpress.org/trunk@60939


git-svn-id: http://core.svn.wordpress.org/trunk@60275 1a063a9b-81f0-0310-95a4-ce76da25c4cd
@dmsnell dmsnell deleted the blocks/add-block-scanner branch October 15, 2025 21:54
github-actions bot pushed a commit to platformsh/wordpress-performance that referenced this pull request Oct 15, 2025
The Block Processor follows the HTML API in providing a streaming, near-zero-overhead, lazy, re-entrant parser for traversing block structure. This class provides an alternate interface to `parse_blocks()` which is more amenable to a number of common server-side operations on posts, especially those needing to operate on only a part of a full post.

Developed in WordPress/wordpress-develop#9105
Discussed in https://core.trac.wordpress.org/ticket/61401

Props dmsnell, gziolo, jonsurrell, soean, tjnowell, westonruter.
Fixes #61401.

Built from https://develop.svn.wordpress.org/trunk@60939


git-svn-id: https://core.svn.wordpress.org/trunk@60275 1a063a9b-81f0-0310-95a4-ce76da25c4cd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants