Skip to content

Feat/fuse remap local ids triton#993

Open
Xu-Sheng-lin wants to merge 3 commits into
mainfrom
feat/fuse-remap-local-ids-triton
Open

Feat/fuse remap local ids triton#993
Xu-Sheng-lin wants to merge 3 commits into
mainfrom
feat/fuse-remap-local-ids-triton

Conversation

@Xu-Sheng-lin
Copy link
Copy Markdown
Collaborator

Replace 6 separate elementwise HIP kernel launches (2x compare, 1x OR, 2x masked_fill, 1x add) in MoriEpIntranodeRouter with a single fused Triton kernel that performs the same remap logic in one pass. Reduces per-layer decode overhead by ~39us (6 launches → 1 launch).

Add MoRI-based expert parallelism (EP) as an alternative to DeepEP for
ROCm GPUs, enabling efficient intranode and internode dispatch/combine
for MoE models.

Core components:
- MoriEPWrapper: singleton wrapper around mori EpDispatchCombineOp
- MoriEpIntranodeRouter: EP router with dispatch/combine flow,
  chunked dispatch for large token counts, and global-to-local ID
  remapping for fused expert kernels
- MoriEpFp4Strategy: strategy pairing MoRI router with FP4 executor
- RocmEpNormalStrategy: MoRI router as first-class option alongside
  DeepEP, with mutual exclusion validation

Config flow:
- --use_mori_ep CLI flag (env: USE_MORI_EP) → deep_ep_config →
  auto_configure_deepep() → moe_config.use_mori_ep
- C++ MoeConfig.use_mori_ep field with pybind binding
- MoEConfigAdapter exposes use_mori_ep and use_deepep_moe

Bazel: moriep_wrapper py_library target with modules dependency.
- FakeBalanceExpert ROCm C++ op for expert load balancing testing
- MoriEpIntranodeRouter unit tests
- BlockPoolConfigHelper adjustment for MoE cache allocation
@Xu-Sheng-lin Xu-Sheng-lin requested a review from LLLLKKKK as a code owner May 12, 2026 03:28
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 12, 2026

CLA assistant check
All committers have signed the CLA.

Replace 6 separate elementwise HIP kernel launches (2x compare,
1x OR, 2x masked_fill, 1x add) in MoriEpIntranodeRouter with a
single fused Triton kernel that performs the same remap logic in
one pass. Reduces per-layer decode overhead by ~39us (6 launches
→ 1 launch).

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@Xu-Sheng-lin Xu-Sheng-lin force-pushed the feat/fuse-remap-local-ids-triton branch from 1b54054 to ce2606a Compare May 12, 2026 03:32
@LLLLKKKK
Copy link
Copy Markdown
Collaborator

AI Code Review - PR #993

Status: BLOCKING

Summary: P0/0 · P1/2 · P2/2 · P3/0

Blocking Issues

P1

  • 新增 skip_allreduce 破坏未更新 router 的 finalize 调用 @ rtp_llm/models_py/modules/factory/fused_moe/defs/fused_moe.py:244
    • 建议:同步更新所有 FusedMoeDataRouter.finalize 实现和抽象接口,或避免向不支持该参数的 router 传关键字。
  • MoriEP 使用 WORLD group 时 rank 与 world_size 不一致 @ rtp_llm/models_py/distributed/moriep_wrapper.py:76
    • 建议:若使用 WORLD group 应传唯一 world_rank;若使用 EP group,则 world_size 应为 ep_size 并注册对应 EP process group。

Non-blocking Suggestions

P2

  • MoriEP 路由在 per-forward 路径打印 info 日志 @ rtp_llm/models_py/modules/factory/fused_moe/impl/rocm/routers/mori_ep_intranode_router.py:86
    • 建议:改为 debug 级别,或使用首次/采样日志并补充必要 metrics。
  • MoriEP 路由测试未接入且当前调用会失败 @ rtp_llm/models_py/modules/factory/fused_moe/impl/rocm/test/BUILD:67
    • 建议:修正 _create 调用或签名,并把 MoriEP router 测试接入 Bazel/CI;无 mori 时用 skip,而不是注释 target。

Checklist Violations (13 fail / 72 total)

General Principles Checklist

  • [6.1] Software Engineering — OCP:本地扩展点优先于修改中心逻辑 → issue 新增 skip_allreduce 破坏未更新 router 的 finalize 调用
    中心 finalize 调用新增参数,但未同步所有 router 实现。
  • [6.1] Software Engineering — LSP:子类/重写保持基类契约 → issue 新增 skip_allreduce 破坏未更新 router 的 finalize 调用
    router.finalize 子类签名不一致,基类调用契约被破坏。
  • [6.1] Architecture — 状态不变量:创建/更新/失败/重试/回滚路径有效 → issue MoriEP 使用 WORLD group 时 rank 与 world_size 不一致
    MoriEP 初始化混用 ep_rank 与 WORLD world_size,分布式 rank 不变量不成立。
  • [6.1] Architecture — 可观测性:日志/指标/超时可操作、非噪声 → issue MoriEP 路由在 per-forward 路径打印 info 日志
    MoriEP prepare/finalize 在热路径使用 logging.info。
  • [6.1] Architecture — 兼容性:公开 API/持久数据/配置/环境迁移安全 → issue 新增 skip_allreduce 破坏未更新 router 的 finalize 调用
    router.finalize 调用约定变更未覆盖所有实现。
  • [6.1] Tests — 新逻辑有聚焦单测 + 相关集成/smoke 测试 → issue MoriEP 路由测试未接入且当前调用会失败
    MoriEP router 测试 target 被注释,未进入 Bazel/CI。
  • [6.1] Tests — 边界 case 覆盖(空、单元素、最大值) → issue MoriEP 路由测试未接入且当前调用会失败
    新增 MoriEP router 测试未接入,边界覆盖无法在 CI 生效。
  • [6.1] Tests — 分布式/跨平台变更有对应覆盖 → issue MoriEP 路由测试未接入且当前调用会失败
    ROCm 分布式 MoriEP 变更没有启用的 py_test 覆盖。
  • [6.1] Quality — 无 per-forward 调试日志 / 噪声热路径输出 → issue MoriEP 路由在 per-forward 路径打印 info 日志
    MoriEP router 在 prepare/finalize 热路径使用 logging.info。

RTP-LLM Checklist

  • [E] 分布式 — 跨 rank 数据一致性 → issue MoriEP 使用 WORLD group 时 rank 与 world_size 不一致
    MoriEP 配置把 ep_rank 与 WORLD world_size 混用,ep_size < world_size 时跨 rank 映射不一致。
  • [H] 测试与 CI — 测试覆盖充分:大重构等价覆盖,新功能端到端测试 → issue MoriEP 路由测试未接入且当前调用会失败
    新增 MoriEP router 测试被注释,没有进入 Bazel/CI。

Python Static-First Checklist

  • [P.G] 测试规范 — 数据驱动测试用 pytest.mark.parametrize → checklist-only
    新增测试使用 unittest/subTest;当前主要阻塞是 target 未接入且调用签名错误,暂不单独升级为 issue。
  • [P.H] 类型标注 — Any 必须附注释说明原因 → checklist-only
    MoriEP wrapper 多处 Any 用于第三方 mori op 透传;建议后续补 Protocol,但当前已有更直接运行问题。

Strengths

  • 新增 use_mori_ep 配置贯通了 Python 配置、C++ MoeConfig 和 pybind 默认值。
  • use_all_gather 规则变化补充了 ep_size == tp_size > 1 的回归测试。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants