Feat/fuse remap local ids triton#993
Open
Xu-Sheng-lin wants to merge 3 commits into
Open
Conversation
Add MoRI-based expert parallelism (EP) as an alternative to DeepEP for ROCm GPUs, enabling efficient intranode and internode dispatch/combine for MoE models. Core components: - MoriEPWrapper: singleton wrapper around mori EpDispatchCombineOp - MoriEpIntranodeRouter: EP router with dispatch/combine flow, chunked dispatch for large token counts, and global-to-local ID remapping for fused expert kernels - MoriEpFp4Strategy: strategy pairing MoRI router with FP4 executor - RocmEpNormalStrategy: MoRI router as first-class option alongside DeepEP, with mutual exclusion validation Config flow: - --use_mori_ep CLI flag (env: USE_MORI_EP) → deep_ep_config → auto_configure_deepep() → moe_config.use_mori_ep - C++ MoeConfig.use_mori_ep field with pybind binding - MoEConfigAdapter exposes use_mori_ep and use_deepep_moe Bazel: moriep_wrapper py_library target with modules dependency.
- FakeBalanceExpert ROCm C++ op for expert load balancing testing - MoriEpIntranodeRouter unit tests - BlockPoolConfigHelper adjustment for MoE cache allocation
Replace 6 separate elementwise HIP kernel launches (2x compare, 1x OR, 2x masked_fill, 1x add) in MoriEpIntranodeRouter with a single fused Triton kernel that performs the same remap logic in one pass. Reduces per-layer decode overhead by ~39us (6 launches → 1 launch). Co-Authored-By: Claude Opus 4.6 <[email protected]>
1b54054 to
ce2606a
Compare
Collaborator
AI Code Review - PR #993Status: BLOCKING Summary: P0/0 · P1/2 · P2/2 · P3/0 Blocking IssuesP1
Non-blocking SuggestionsP2
Checklist Violations (13 fail / 72 total)General Principles Checklist
RTP-LLM Checklist
Python Static-First Checklist
Strengths
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replace 6 separate elementwise HIP kernel launches (2x compare, 1x OR, 2x masked_fill, 1x add) in MoriEpIntranodeRouter with a single fused Triton kernel that performs the same remap logic in one pass. Reduces per-layer decode overhead by ~39us (6 launches → 1 launch).