Skip to content

[Feature] block_wise_fp8 support triton_moe_backend #2767

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 9, 2025

Conversation

ckl117
Copy link
Collaborator

@ckl117 ckl117 commented Jul 9, 2025

block_wise_fp8 support triton_moe_backend,

export FD_USE_DEEP_GEMM=0 # use triton_moe_backend for FP8 blockwise MoE.

Copy link

paddle-bot bot commented Jul 9, 2025

Thanks for your contribution!

@Jiang-Jia-Jun Jiang-Jia-Jun requested a review from Copilot July 9, 2025 04:11
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for FP8 blockwise mixture-of-experts (MoE) quantization using a Triton backend, selectable via a new FD_USE_DEEP_GEMM environment variable.

  • Introduces use_deep_gemm flag in BlockWiseFP8Config and switches between DeepGemm and Triton MoE methods
  • Adds BlockWiseFP8MoEMethod in fused_moe_triton_backend.py with Triton kernels
  • Registers FD_USE_DEEP_GEMM in fastdeploy/envs.py and updates documentation

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
fastdeploy/model_executor/layers/quantization/block_wise_fp8.py Added env var import, flag logic, and conditional backend selection
fastdeploy/model_executor/layers/moe/fused_moe_triton_backend.py Implemented BlockWiseFP8MoEMethod with Triton kernels
fastdeploy/envs.py Registered new FD_USE_DEEP_GEMM environment variable
docs/zh/usage/environment_variables.md Documented FD_USE_DEEP_GEMM in Chinese docs
docs/usage/environment_variables.md Documented FD_USE_DEEP_GEMM in English docs
Comments suppressed due to low confidence (3)

fastdeploy/model_executor/layers/quantization/block_wise_fp8.py:65

  • [nitpick] For clarity, indent the return BlockWiseFP8MoEMethod(self) under the else: block to group the import and the return together.
            return BlockWiseFP8MoEMethod(self)

fastdeploy/model_executor/layers/moe/fused_moe_triton_backend.py:489

  • The type hint layer: nn.Layer is used but nn (e.g., import paddle.nn as nn) is not imported in this file; this will cause a NameError.
class BlockWiseFP8MoEMethod(QuantMethodBase):

docs/usage/environment_variables.md:73

  • The indentation of this lambda line is inconsistent with the surrounding block; adjust it to align under its key for better readability.
    lambda: bool(int(os.getenv("FD_USE_DEEP_GEMM", "1"))),

Comment on lines 41 to 43
self.use_deep_gemm = False
if envs.FD_USE_DEEP_GEMM:
self.use_deep_gemm = True
Copy link
Preview

Copilot AI Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The flag initialization can be simplified to a single assignment: self.use_deep_gemm = bool(envs.FD_USE_DEEP_GEMM) instead of setting it to False and then conditionally overriding it.

Suggested change
self.use_deep_gemm = False
if envs.FD_USE_DEEP_GEMM:
self.use_deep_gemm = True
self.use_deep_gemm = bool(envs.FD_USE_DEEP_GEMM)

Copilot uses AI. Check for mistakes.

Comment on lines +599 to +603
cache13 = paddle.empty([token_num * top_k * max(N1, N2)],
dtype=x.dtype)
intermediate_cache1 = cache13[:token_num * top_k * N1].view(
[token_num * top_k, N1])
intermediate_cache3 = cache13[:token_num * top_k * N2].view(
Copy link
Preview

Copilot AI Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The variable name cache13 is not descriptive; consider renaming it to something like combined_intermediate_buffer or adding a comment to explain its reuse strategy.

Suggested change
cache13 = paddle.empty([token_num * top_k * max(N1, N2)],
dtype=x.dtype)
intermediate_cache1 = cache13[:token_num * top_k * N1].view(
[token_num * top_k, N1])
intermediate_cache3 = cache13[:token_num * top_k * N2].view(
# Create a shared buffer for intermediate computations.
# This buffer is sliced and reshaped into `intermediate_cache1` and `intermediate_cache3`.
combined_intermediate_buffer = paddle.empty([token_num * top_k * max(N1, N2)],
dtype=x.dtype)
intermediate_cache1 = combined_intermediate_buffer[:token_num * top_k * N1].view(
[token_num * top_k, N1])
intermediate_cache3 = combined_intermediate_buffer[:token_num * top_k * N2].view(

Copilot uses AI. Check for mistakes.

@zhoutianzi666 zhoutianzi666 merged commit 888780f into PaddlePaddle:develop Jul 9, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants