-
Notifications
You must be signed in to change notification settings - Fork 601
[Feature] block_wise_fp8 support triton_moe_backend #2767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] block_wise_fp8 support triton_moe_backend #2767
Conversation
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for FP8 blockwise mixture-of-experts (MoE) quantization using a Triton backend, selectable via a new FD_USE_DEEP_GEMM
environment variable.
- Introduces
use_deep_gemm
flag inBlockWiseFP8Config
and switches between DeepGemm and Triton MoE methods - Adds
BlockWiseFP8MoEMethod
infused_moe_triton_backend.py
with Triton kernels - Registers
FD_USE_DEEP_GEMM
infastdeploy/envs.py
and updates documentation
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
fastdeploy/model_executor/layers/quantization/block_wise_fp8.py | Added env var import, flag logic, and conditional backend selection |
fastdeploy/model_executor/layers/moe/fused_moe_triton_backend.py | Implemented BlockWiseFP8MoEMethod with Triton kernels |
fastdeploy/envs.py | Registered new FD_USE_DEEP_GEMM environment variable |
docs/zh/usage/environment_variables.md | Documented FD_USE_DEEP_GEMM in Chinese docs |
docs/usage/environment_variables.md | Documented FD_USE_DEEP_GEMM in English docs |
Comments suppressed due to low confidence (3)
fastdeploy/model_executor/layers/quantization/block_wise_fp8.py:65
- [nitpick] For clarity, indent the
return BlockWiseFP8MoEMethod(self)
under theelse:
block to group the import and the return together.
return BlockWiseFP8MoEMethod(self)
fastdeploy/model_executor/layers/moe/fused_moe_triton_backend.py:489
- The type hint
layer: nn.Layer
is used butnn
(e.g.,import paddle.nn as nn
) is not imported in this file; this will cause aNameError
.
class BlockWiseFP8MoEMethod(QuantMethodBase):
docs/usage/environment_variables.md:73
- The indentation of this
lambda
line is inconsistent with the surrounding block; adjust it to align under its key for better readability.
lambda: bool(int(os.getenv("FD_USE_DEEP_GEMM", "1"))),
self.use_deep_gemm = False | ||
if envs.FD_USE_DEEP_GEMM: | ||
self.use_deep_gemm = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The flag initialization can be simplified to a single assignment: self.use_deep_gemm = bool(envs.FD_USE_DEEP_GEMM)
instead of setting it to False and then conditionally overriding it.
self.use_deep_gemm = False | |
if envs.FD_USE_DEEP_GEMM: | |
self.use_deep_gemm = True | |
self.use_deep_gemm = bool(envs.FD_USE_DEEP_GEMM) |
Copilot uses AI. Check for mistakes.
cache13 = paddle.empty([token_num * top_k * max(N1, N2)], | ||
dtype=x.dtype) | ||
intermediate_cache1 = cache13[:token_num * top_k * N1].view( | ||
[token_num * top_k, N1]) | ||
intermediate_cache3 = cache13[:token_num * top_k * N2].view( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The variable name cache13
is not descriptive; consider renaming it to something like combined_intermediate_buffer
or adding a comment to explain its reuse strategy.
cache13 = paddle.empty([token_num * top_k * max(N1, N2)], | |
dtype=x.dtype) | |
intermediate_cache1 = cache13[:token_num * top_k * N1].view( | |
[token_num * top_k, N1]) | |
intermediate_cache3 = cache13[:token_num * top_k * N2].view( | |
# Create a shared buffer for intermediate computations. | |
# This buffer is sliced and reshaped into `intermediate_cache1` and `intermediate_cache3`. | |
combined_intermediate_buffer = paddle.empty([token_num * top_k * max(N1, N2)], | |
dtype=x.dtype) | |
intermediate_cache1 = combined_intermediate_buffer[:token_num * top_k * N1].view( | |
[token_num * top_k, N1]) | |
intermediate_cache3 = combined_intermediate_buffer[:token_num * top_k * N2].view( |
Copilot uses AI. Check for mistakes.
block_wise_fp8 support triton_moe_backend,