[Feature] block_wise_fp8 support triton_moe_backend #2767

ckl117 · 2025-07-09T04:03:51Z

block_wise_fp8 support triton_moe_backend,

export FD_USE_DEEP_GEMM=0 # use triton_moe_backend for FP8 blockwise MoE.

paddle-bot · 2025-07-09T04:03:56Z

Thanks for your contribution!

Copilot

Pull Request Overview

This PR adds support for FP8 blockwise mixture-of-experts (MoE) quantization using a Triton backend, selectable via a new FD_USE_DEEP_GEMM environment variable.

Introduces use_deep_gemm flag in BlockWiseFP8Config and switches between DeepGemm and Triton MoE methods
Adds BlockWiseFP8MoEMethod in fused_moe_triton_backend.py with Triton kernels
Registers FD_USE_DEEP_GEMM in fastdeploy/envs.py and updates documentation

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
fastdeploy/model_executor/layers/quantization/block_wise_fp8.py	Added env var import, flag logic, and conditional backend selection
fastdeploy/model_executor/layers/moe/fused_moe_triton_backend.py	Implemented `BlockWiseFP8MoEMethod` with Triton kernels
fastdeploy/envs.py	Registered new `FD_USE_DEEP_GEMM` environment variable
docs/zh/usage/environment_variables.md	Documented `FD_USE_DEEP_GEMM` in Chinese docs
docs/usage/environment_variables.md	Documented `FD_USE_DEEP_GEMM` in English docs

Comments suppressed due to low confidence (3)

fastdeploy/model_executor/layers/quantization/block_wise_fp8.py:65

[nitpick] For clarity, indent the return BlockWiseFP8MoEMethod(self) under the else: block to group the import and the return together.

            return BlockWiseFP8MoEMethod(self)

fastdeploy/model_executor/layers/moe/fused_moe_triton_backend.py:489

The type hint layer: nn.Layer is used but nn (e.g., import paddle.nn as nn) is not imported in this file; this will cause a NameError.

class BlockWiseFP8MoEMethod(QuantMethodBase):

docs/usage/environment_variables.md:73

The indentation of this lambda line is inconsistent with the surrounding block; adjust it to align under its key for better readability.

    lambda: bool(int(os.getenv("FD_USE_DEEP_GEMM", "1"))),

Copilot · 2025-07-09T04:12:42Z

fastdeploy/model_executor/layers/quantization/block_wise_fp8.py

+        self.use_deep_gemm = False
+        if envs.FD_USE_DEEP_GEMM:
+            self.use_deep_gemm = True


[nitpick] The flag initialization can be simplified to a single assignment: self.use_deep_gemm = bool(envs.FD_USE_DEEP_GEMM) instead of setting it to False and then conditionally overriding it.

Suggested change

self.use_deep_gemm = False

if envs.FD_USE_DEEP_GEMM:

self.use_deep_gemm = True

self.use_deep_gemm = bool(envs.FD_USE_DEEP_GEMM)

Copilot · 2025-07-09T04:12:43Z

fastdeploy/model_executor/layers/moe/fused_moe_triton_backend.py

+        cache13 = paddle.empty([token_num * top_k * max(N1, N2)],
+                               dtype=x.dtype)
+        intermediate_cache1 = cache13[:token_num * top_k * N1].view(
+            [token_num * top_k, N1])
+        intermediate_cache3 = cache13[:token_num * top_k * N2].view(


[nitpick] The variable name cache13 is not descriptive; consider renaming it to something like combined_intermediate_buffer or adding a comment to explain its reuse strategy.

Suggested change

cache13 = paddle.empty([token_num * top_k * max(N1, N2)],

dtype=x.dtype)

intermediate_cache1 = cache13[:token_num * top_k * N1].view(

[token_num * top_k, N1])

intermediate_cache3 = cache13[:token_num * top_k * N2].view(

# Create a shared buffer for intermediate computations.

# This buffer is sliced and reshaped into `intermediate_cache1` and `intermediate_cache3`.

combined_intermediate_buffer = paddle.empty([token_num * top_k * max(N1, N2)],

dtype=x.dtype)

intermediate_cache1 = combined_intermediate_buffer[:token_num * top_k * N1].view(

[token_num * top_k, N1])

intermediate_cache3 = combined_intermediate_buffer[:token_num * top_k * N2].view(

block_wise_fp8 support triton_moe_backend

3852299

Jiang-Jia-Jun requested a review from Copilot July 9, 2025 04:11

Copilot AI reviewed Jul 9, 2025

View reviewed changes

code check

c2e909f

zhoutianzi666 approved these changes Jul 9, 2025

View reviewed changes

zhoutianzi666 merged commit 888780f into PaddlePaddle:develop Jul 9, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] block_wise_fp8 support triton_moe_backend #2767

[Feature] block_wise_fp8 support triton_moe_backend #2767

Uh oh!

ckl117 commented Jul 9, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Jul 9, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jul 9, 2025

Uh oh!

Copilot AI Jul 9, 2025

Uh oh!

Uh oh!

Uh oh!

[Feature] block_wise_fp8 support triton_moe_backend #2767

[Feature] block_wise_fp8 support triton_moe_backend #2767

Uh oh!

Conversation

ckl117 commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paddle-bot bot commented Jul 9, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ckl117 commented Jul 9, 2025 •

edited

Loading