Feat/fuse remap local ids triton by Xu-Sheng-lin · Pull Request #993 · alibaba/rtp-llm

Xu-Sheng-lin · 2026-05-12T03:28:37Z

Replace 6 separate elementwise HIP kernel launches (2x compare, 1x OR, 2x masked_fill, 1x add) in MoriEpIntranodeRouter with a single fused Triton kernel that performs the same remap logic in one pass. Reduces per-layer decode overhead by ~39us (6 launches → 1 launch).

Add MoRI-based expert parallelism (EP) as an alternative to DeepEP for ROCm GPUs, enabling efficient intranode and internode dispatch/combine for MoE models. Core components: - MoriEPWrapper: singleton wrapper around mori EpDispatchCombineOp - MoriEpIntranodeRouter: EP router with dispatch/combine flow, chunked dispatch for large token counts, and global-to-local ID remapping for fused expert kernels - MoriEpFp4Strategy: strategy pairing MoRI router with FP4 executor - RocmEpNormalStrategy: MoRI router as first-class option alongside DeepEP, with mutual exclusion validation Config flow: - --use_mori_ep CLI flag (env: USE_MORI_EP) → deep_ep_config → auto_configure_deepep() → moe_config.use_mori_ep - C++ MoeConfig.use_mori_ep field with pybind binding - MoEConfigAdapter exposes use_mori_ep and use_deepep_moe Bazel: moriep_wrapper py_library target with modules dependency.

- FakeBalanceExpert ROCm C++ op for expert load balancing testing - MoriEpIntranodeRouter unit tests - BlockPoolConfigHelper adjustment for MoE cache allocation

CLAassistant · 2026-05-12T03:28:45Z

All committers have signed the CLA.

Replace 6 separate elementwise HIP kernel launches (2x compare, 1x OR, 2x masked_fill, 1x add) in MoriEpIntranodeRouter with a single fused Triton kernel that performs the same remap logic in one pass. Reduces per-layer decode overhead by ~39us (6 launches → 1 launch). Co-Authored-By: Claude Opus 4.6 <[email protected]>

LLLLKKKK · 2026-05-12T03:58:58Z

AI Code Review - PR #993

Status: BLOCKING

Summary: P0/0 · P1/2 · P2/2 · P3/0

Blocking Issues

P1

新增 skip_allreduce 破坏未更新 router 的 finalize 调用 @ rtp_llm/models_py/modules/factory/fused_moe/defs/fused_moe.py:244
- 建议：同步更新所有 FusedMoeDataRouter.finalize 实现和抽象接口，或避免向不支持该参数的 router 传关键字。
MoriEP 使用 WORLD group 时 rank 与 world_size 不一致 @ rtp_llm/models_py/distributed/moriep_wrapper.py:76
- 建议：若使用 WORLD group 应传唯一 world_rank；若使用 EP group，则 world_size 应为 ep_size 并注册对应 EP process group。

Non-blocking Suggestions

P2

MoriEP 路由在 per-forward 路径打印 info 日志 @ rtp_llm/models_py/modules/factory/fused_moe/impl/rocm/routers/mori_ep_intranode_router.py:86
- 建议：改为 debug 级别，或使用首次/采样日志并补充必要 metrics。
MoriEP 路由测试未接入且当前调用会失败 @ rtp_llm/models_py/modules/factory/fused_moe/impl/rocm/test/BUILD:67
- 建议：修正 _create 调用或签名，并把 MoriEP router 测试接入 Bazel/CI；无 mori 时用 skip，而不是注释 target。

Checklist Violations (13 fail / 72 total)

General Principles Checklist

[6.1] Software Engineering — OCP：本地扩展点优先于修改中心逻辑 → issue 新增 skip_allreduce 破坏未更新 router 的 finalize 调用
中心 finalize 调用新增参数，但未同步所有 router 实现。
[6.1] Software Engineering — LSP：子类/重写保持基类契约 → issue 新增 skip_allreduce 破坏未更新 router 的 finalize 调用
router.finalize 子类签名不一致，基类调用契约被破坏。
[6.1] Architecture — 状态不变量：创建/更新/失败/重试/回滚路径有效 → issue MoriEP 使用 WORLD group 时 rank 与 world_size 不一致
MoriEP 初始化混用 ep_rank 与 WORLD world_size，分布式 rank 不变量不成立。
[6.1] Architecture — 可观测性：日志/指标/超时可操作、非噪声 → issue MoriEP 路由在 per-forward 路径打印 info 日志
MoriEP prepare/finalize 在热路径使用 logging.info。
[6.1] Architecture — 兼容性：公开 API/持久数据/配置/环境迁移安全 → issue 新增 skip_allreduce 破坏未更新 router 的 finalize 调用
router.finalize 调用约定变更未覆盖所有实现。
[6.1] Tests — 新逻辑有聚焦单测 + 相关集成/smoke 测试 → issue MoriEP 路由测试未接入且当前调用会失败
MoriEP router 测试 target 被注释，未进入 Bazel/CI。
[6.1] Tests — 边界 case 覆盖（空、单元素、最大值） → issue MoriEP 路由测试未接入且当前调用会失败
新增 MoriEP router 测试未接入，边界覆盖无法在 CI 生效。
[6.1] Tests — 分布式/跨平台变更有对应覆盖 → issue MoriEP 路由测试未接入且当前调用会失败
ROCm 分布式 MoriEP 变更没有启用的 py_test 覆盖。
[6.1] Quality — 无 per-forward 调试日志 / 噪声热路径输出 → issue MoriEP 路由在 per-forward 路径打印 info 日志
MoriEP router 在 prepare/finalize 热路径使用 logging.info。

RTP-LLM Checklist

[E] 分布式 — 跨 rank 数据一致性 → issue MoriEP 使用 WORLD group 时 rank 与 world_size 不一致
MoriEP 配置把 ep_rank 与 WORLD world_size 混用，ep_size < world_size 时跨 rank 映射不一致。
[H] 测试与 CI — 测试覆盖充分：大重构等价覆盖，新功能端到端测试 → issue MoriEP 路由测试未接入且当前调用会失败
新增 MoriEP router 测试被注释，没有进入 Bazel/CI。

Python Static-First Checklist

[P.G] 测试规范 — 数据驱动测试用 pytest.mark.parametrize → checklist-only
新增测试使用 unittest/subTest；当前主要阻塞是 target 未接入且调用签名错误，暂不单独升级为 issue。
[P.H] 类型标注 — Any 必须附注释说明原因 → checklist-only
MoriEP wrapper 多处 Any 用于第三方 mori op 透传；建议后续补 Protocol，但当前已有更直接运行问题。

Strengths

新增 use_mori_ep 配置贯通了 Python 配置、C++ MoeConfig 和 pybind 默认值。
use_all_gather 规则变化补充了 ep_size == tp_size > 1 的回归测试。

jacobwin-ai added 2 commits May 9, 2026 09:18

feat: add fake balance expert and MoriEP tests

9cee69f

- FakeBalanceExpert ROCm C++ op for expert load balancing testing - MoriEpIntranodeRouter unit tests - BlockPoolConfigHelper adjustment for MoE cache allocation

Xu-Sheng-lin requested a review from LLLLKKKK as a code owner May 12, 2026 03:28

Xu-Sheng-lin force-pushed the feat/fuse-remap-local-ids-triton branch from 1b54054 to ce2606a Compare May 12, 2026 03:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/fuse remap local ids triton#993

Feat/fuse remap local ids triton#993
Xu-Sheng-lin wants to merge 3 commits into
mainfrom
feat/fuse-remap-local-ids-triton

Xu-Sheng-lin commented May 12, 2026

Uh oh!

CLAassistant commented May 12, 2026 •

edited

Loading

Uh oh!

LLLLKKKK commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Xu-Sheng-lin commented May 12, 2026

Uh oh!

CLAassistant commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LLLLKKKK commented May 12, 2026

AI Code Review - PR #993

Blocking Issues

P1

Non-blocking Suggestions

P2

Checklist Violations (13 fail / 72 total)

Strengths

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CLAassistant commented May 12, 2026 •

edited

Loading