-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Optimize fused_moe_kernel with Split-K #9486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces an optimization for the fused_moe_kernel by implementing Split-K v1 support. The goal is to improve the performance of this critical operator in Mixture of Experts (MoE) models, particularly in the context of Tensor Parallelism (TP MoE). While the current implementation, which uses tl.atomic_add, achieves performance close to the baseline, it lays the groundwork for further optimizations and addresses a key area for performance improvement.
Highlights
- Fused MoE Kernel Optimization: Implemented Split-K v1 support for the fused_moe_kernel to enhance performance in Mixture of Experts operations.
- New Kernel Introduction: A new Triton kernel, fused_moe_kernel_splitk, has been added to handle the Split-K logic, enabling parallel computation across the K dimension.
- Performance Characteristics: Benchmarking indicates that the Split-K implementation, particularly with SPLIT_K=4, achieves performance comparable to the existing baseline, with potential for further gains.
- Identified Limitations: The current approach uses tl.atomic_add, which introduces a performance overhead, and there are identified issues with potential 32-bit integer overflow for strides in large workspaces, suggesting areas for future refactoring.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new Split-K Triton kernel for the fused Mixture of Experts (MoE) operation, aiming to optimize performance. The changes include the new fused_moe_kernel_splitk
kernel, logic to invoke it based on configuration, and updates to default configurations to support this new feature. My review focuses on improving code quality and maintainability. I've identified several unused parameters in the new kernel, a redundant variable assignment, and a hardcoded configuration value that could be made more flexible. Addressing these points will enhance the clarity and long-term health of the codebase.
d14e18a
to
2117b92
Compare
It seems that increasing parallelism with splitk doesn't provide significant benefits for fused MoE. The analysis above is cool, but from a practical and code redundancy perspective, it's recommended not to introduce this kernel implementation in fused_moe_triton? |
Motivation
In TP MoE, fused_moe_kernel is one of the most critical and time consuming operators,

Per profiling, this kernel performs very high compute-throughput. That explains why it outperforms the Triton v3.4.0 fused_moe kernel according to the benchmark results in #9276.


The only defect of the baseline is a little warp stall sampling in LDS, which seems to be inevitable. Split-K also encounters this issue.

This PR's main idea is to introduce Split-K. The new kernel is not exceeding but close to baseline.

Per benchmark result, SPLIT_K=4 performs best on H20-96GB.
This PR currently sacrifices performance by introducing tl.atomic_add during parallel tl.store C. That means there is still room for improvement in the PR in the following parts, but it requires refactor, which makes the scope out of control. So I pop up this PR for open discussion:
As a result, the pointer ends up referencing unallocated memory, causing the GPU to immediately throw an illegal memory access error. So all the strides need to be changed to int64, otherwise there's OOB. As the workspace layout is calculated with BLOCK_M * BLOCK_N, num_n * BLOCK_M * BLOCK_N, … which is python integer, Triton down-cast to 32-bit const. When the product of pid_* and stride_ws_* exceeds 2^32 − 1, an overflow occurs.
In general, even after refactoring it to the workspace‑based Split‑K version, it can only achieve performance parity with the baseline. This also reflects the excellence of the baseline version, which was first introduced in vLLM 2024.1 by DeepSeek and several other open‑source developers. Much respect to them.
The following is this PR's profiling comparing to baseline. E2E test passed with correct result in different Split-K.
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist