[Improve][connector-starrocks] Improved starrocks source enumerator splits allocation algorithm for subtasks by JeremyXin · Pull Request #10867 · apache/seatunnel

JeremyXin · 2026-05-11T08:45:34Z

Purpose of this pull request

Similar to pr #9108, improving starrocks source enumerator splits allocation algorithm for subtasks and add UT.

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

If any new Jar binary package adding in your PR, please add License Notice according
New License Guide
If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
If necessary, please update incompatible-changes.md to describe the incompatibility caused by this PR.
If you are contributing the connector code, please check that the following files are updated:
1. Update plugin-mapping.properties and add new connector information in it
2. Update the pom file of seatunnel-dist
3. Add ci label in label-scope-conf
4. Add e2e testcase in seatunnel-e2e
5. Update connector plugin_config

DanielLeens

Thanks for the contribution. I pulled the latest head locally and reviewed the real StarRocks source enumerator lifecycle instead of only the helper diff.

What this PR fixes

User pain: the current owner calculation can leave StarRocks splits uneven across readers.
Fix approach: sort split ids first, then assign them round-robin.
In one sentence: the goal makes sense, but the current implementation breaks both the recovery contract and the real multi-table runtime path.

Runtime path I checked

normal startup
  -> run() [StartRocksSourceSplitEnumerator.java:81-94]
      -> poll one table
      -> getStarRocksSourceSplit(table) [193-201]
      -> addPendingSplit(newSplits) [151-164]
      -> assignSplit(readers) [167-190]

reader recovery
  -> addSplitsBack(splits, subtaskId) [102-106]
      -> addPendingSplit(splits)
      -> assignSplit(subtaskId)

Problem 1: the returned-split recovery path is no longer correct

Location: StartRocksSourceSplitEnumerator.java:102-106, 151-164
Why this is a problem: before this change, addSplitsBack() could recompute the same owner from splitId.hashCode() and then immediately assign back to the recovering subtaskId. After the change, owner selection depends on the local round-robin order inside the current addPendingSplit() call. That means a split returned by reader 2 can now be put into reader 0's bucket, while assignSplit(Collections.singletonList(subtaskId)) still only tries to send work back to reader 2.
Risk: the split can remain stranded in pendingSplit, so the recovery path can stall even though the source still has unassigned work.
Suggested fix:
- Option A: keep addSplitsBack() pinned to the original subtaskId instead of reusing the new owner calculation.
- Option B: if you want re-computed ownership, it still needs to be derived from a stable split identity, not from the per-call round-robin position.
Severity: high

Problem 2: the new balancing claim does not hold on the real multi-table path

Location: StartRocksSourceSplitEnumerator.java:81-94, 151-164
Why this is a problem: run() processes one table at a time, but assignCount is reset to 0 on every addPendingSplit() call. So if a table produces a single split, that split always starts again from reader 0. With many small tables, the normal path still concentrates work on low-number readers.
Risk: the PR can replace one skew mode with another, especially for common small-table workloads.
Suggested fix:
- Option A: keep assignCount monotonic across the whole enumeration cycle.
- Option B: aggregate all generated splits first, then do one global round-robin pass.
Severity: high

Tests

The new test only exercises one synthetic single-batch call to the private addPendingSplit() helper (StarRocksSourceSplitEnumeratorTest.java:40-60). It does not cover the actual run() batching behavior or the addSplitsBack() recovery path, which is exactly where the regressions are.

Conclusion: merge after fixes

Blocking items

Problem 1: recovery can strand returned splits in the wrong reader bucket.
Problem 2: the real multi-table path still does not produce the balanced behavior the PR is aiming for.

Suggested follow-up

Please add lifecycle-level coverage for both run() and addSplitsBack() once the logic is adjusted.

[Improve][Connector-V2] Balance StarRocks source split allocation

e130837

github-actions Bot added starrocks connectors-v2 labels May 11, 2026

DanielLeens suggested changes May 11, 2026

View reviewed changes

fix: fix error

cec18cc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Improve][connector-starrocks] Improved starrocks source enumerator splits allocation algorithm for subtasks#10867

[Improve][connector-starrocks] Improved starrocks source enumerator splits allocation algorithm for subtasks#10867
JeremyXin wants to merge 2 commits into
apache:devfrom
JeremyXin:improve-starrocks-split-balance

JeremyXin commented May 11, 2026

Uh oh!

DanielLeens left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JeremyXin commented May 11, 2026

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Uh oh!

DanielLeens left a comment

Choose a reason for hiding this comment

Conclusion: merge after fixes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants