FastDeploy v2.1.0通过升级KVCache调度机制、增强高并发场景能力以及丰富采样策略,进一步提升用户体验和服务稳定性;通过CUDA Graph以及MTP等多项优化提升推理性能;此外,还新增支持多款国产硬件上文心开源模型的推理能力。
使用体验优化
- KVCache调度机制升级:采用输入与输出的KVCache统一管理方式,解决此前由于
kv_cache_ratio
参数配置不当导致的OOM问题;解决多模态模型由于输出KVCache不足,生成提前结束的问题。部署时通过配置环境变量export ENABLE_V1_KVCACHE_SCHEDULER=1
启用(下个版本会默认开启),即可不再依赖kv_cache_ratio
的设置,推荐使用。 - 高并发场景功能增强:增加
max_concurrency
/max_waiting_time
控制并发,对于超时请求进行拒绝优化用户体验,保障服务稳定性。 - 多样的采样方式支持:新增
min_p
、top_k_top_p
采样方式支持,使用方式参考 采样说明;同时增加基于Repetition策略和基于stop词列表早停能力,详见 早停说明。 - 服务化部署能力提升:增加
return_token_ids
/include_stop_str_in_output
/logprobs
等参数支持返回更完整的推理信息。 - 默认参数下性能提升:增强因max_num_seqs默认值与实际并发不一致时性能下降问题,避免手动修改max_num_seqs。
推理性能优化
- CUDA Graph覆盖更多场景:覆盖多卡推理,支持与上下文缓存、Chunked Prefill同时使用,在ERNIE 4.5系列、Qwen3系列模型上性能提升17%~91%,详细使用可以参考最佳实践文档。
- MTP投机解码性能提升 :优化算子性能,减少CPU调度开销,提升整体性能;同时,相比v2.0.0版本新增ERNIE-4.5-21B-A3B模型支持MTP投机解码。
- 算子性能优化:优化W4A8、 KVCache INT4、WINT2 Group GEMM等计算Kernel,提升性能;如ERNIE-4.5-300B-A47B WINT2模型性能提升25.5%。
- PD分离完成更多模型验证:P节点完善FlashAttention后端,提升长文推理性能,并基于ERNIE-4.5-21B-A3B等轻量模型完成验证。
国产硬件部署能力升级
- 新增支持昆仑芯P800上ERNIE-4.5-21B-A3B模型部署,更多说明参考 昆仑芯P800部署文档。
- 新增支持海光K100-AI上ERNIE4.5文本系列模型部署,更多说明参考 海光K100-AI部署文档。
- 新增支持燧原S60上ERNIE4.5文本系列模型的部署,更多说明参考 燧原S60部署文档。
- 新增支持天数天垓150上ERNIE-4.5-300B-A47B和ERNIE-4.5-21B-A3B模型部署,并优化推理性能,更多说明参考 天数部署文档。
ERNIE4.5 模型国产硬件推理适配情况(✅ 已支持 🚧 适配中 ⛔暂无计划) | ||||||
---|---|---|---|---|---|---|
模型 | 昆仑芯P800 | 昇腾910B | 海光K100-AI | 天数天垓150 | 沐曦曦云C550 | 燧原S60/L600 |
ERNIE4.5-VL-424B-A47B | 🚧 | 🚧 | ⛔ | ⛔ | ⛔ | ⛔ |
ERNIE4.5-300B-A47B | ✅ | 🚧 | ✅ | ✅ | 🚧 | ✅ |
ERNIE4.5-VL-28B-A3B | 🚧 | 🚧 | ⛔ | 🚧 | ⛔ | ⛔ |
ERNIE4.5-21B-A3B | ✅ | 🚧 | ✅ | ✅ | ✅ | ✅ |
ERNIE4.5-0.3B | ✅ | 🚧 | ✅ | ✅ | ✅ | ✅ |
相关文档和说明
- 升级对飞桨框架的依赖**,FastDeploy v2.1.0版本依赖PaddlePaddle v3.1.1版本**,PaddlePaddle安装方式请参考飞桨官网安装说明
- FastDeploy v2.1.0的服务部署请求不再推荐使用metadata字段(Deprecated,v2.1.0版本可使用,未来会移除),更新为使用extra_body,详见参数支持说明
- FastDeploy多硬件安装和编译说明
- FastDeploy部署参数
- 服务化部署使用说明
- GPU部署最佳实践
更详细的说明列举如下,
-
新增功能
- PD分离D服务支持W4A8在线/离线量化
- PD分离开启Chunked Prefill下支持逐Chunk的KVCache传输
- 支持logprobs返回
- 支持OpenTelemetry采集请求处理状态
- 新增return_token_ids参数,支持返回请求的输入和输出Token ID列表
- 新增include_stop_str_in_output参数,支持结束符的返回
- 新增QwQ模型 enable_thinking参数控制思考模式开关
- 新增repetition早停功能支持
- 新增stop参数支持
- 新增多机张量并行部署支持
- 新增服务请求并发与超时控制
- 支持min_p/top_k_top_p采样
- 支持bad_words
- 优化OpenAI API-Server接口,支持extra_body扩充额外参数支持,废弃metadata的使用
-
性能优化
- PD分离EP并行下Decode的W4A8计算性能优化
- 基于权重重排优化WINT2 Group-GEMM算子Kernel性能
- MTP优化下支持开启Chunked Prefill
- 优化MTP & 投机解码推理性能
- 基于Triton优化Blockwise FP8量化性能
- CUDA Graph 支持 Padding Batch,显存占用大幅减少
- 新增Custom All Reduce算子,CUDA Graph支持TP并行
- 支持Chunked Prefill下开启CUDA Graph
- GetBolockShapeAndSplitKVBlock算子性能优化
- Attention支持C4非对称量化推理
- FlashAttn后端适配TP并行及支持FlashAttention V2
- KVCache管理机制升级,当前仅支持GPU,通过export ENABLE_V1_KVCACHE_SCHEDULER=1启用
- FlashAttention V3下支持开启C16/C8/C4的Chunked Prefill优化
- 服务部署支持Engine自动聚合生成结果提升服务与客户端通信效率
-
多硬件支持
- 昆仑芯 P800支持ERNIE-21B-A3B Wint4/Wint8模型
- 海光K100-AI支持ERNIE-4.5-300B-A47B & ERNIE-4.5-21B-A3B模型
- 燧原S60支持ERNIE4.5系列模型
- 天数支持ERNIE-4.5-300B-A47B & ERNIE-4.5-21B-A3B模型,并进行性能优化
-
Bug修复
- 修复PD分离部署架构开启MTP时D服务首Token错误问题
- 修复SFT后文心纯文模型Token采样越界问题
- 修复XPU 非0卡启动显存OOM的问题
- 修复XPU使用ENABLE_V1_KVCACHE_SCHEDULER=1性能下降问题
- 修复Chunked Prefill下多模态模型并发推理模型崩溃问题
- 修复Qwen3-8B模型生成结果乱码的问题
- 修复RMSNorm硬编码的问题
- 修复linear.py中qkv_bias没定义的问题
- 修复max_tokens=1时报错的问题
- 修复token_processor输入日志格式问题
- 修复chunked_prefill下,chunk size 小于block size服务hang问题
- 修复vl 场景下数据保存问题
-
文档
- 增加中文ReadME和MKDocs支持
- 新增各模型部署最佳实践文档
- 增加Sampling和Early Stopping使用文档说明
- 更新CUDA Graph与动转静使用接口和文档
- 更新模型支持文档
-
其它
- 新增中英文文档支持
- 优化模型加载量化模型参数报错信息
- 统一多模态模型和纯文模型的ModelRunner
- 基于triton_utils更新WINT2 Triton算子
- 优化代码中多个Config实现繁杂问题
What's Changed
- add wint2 performance by @ZhangHandi in #2673
- Update gh-pages.yml by @DDDivano in #2680
- add --force-reinstall --no-cache-dir when pip install fastdeploy*.whl by @yuanlehome in #2682
- [Sync] Update to latest code by @Jiang-Jia-Jun in #2679
- [doc] update docs by @kevincheng2 in #2690
- [Bug] fix logger format by @ltd0924 in #2689
- [feat] support fa3 backend for pd disaggregated by @yuanlehome in #2695
- add quick benchmark script by @DDDivano in #2703
- [Doc] modify reasoning_output docs by @LiqinruiG in #2696
- [MTP] Support chunked_prefill in speculative decoding(MTP) by @freeliuzc in #2705
- [RL] update reschedule finish reason by @ltd0924 in #2709
- [feature]add fd whl version info by @gzy19990617 in #2698
- Extract eh_proj Layer from ParallelLMHead for MTP to Avoid Weight Transposition Issue by @Deleter-D in #2707
- 添加XPU CI, test=model by @quanxiang-liu in #2701
- [CI] Add validation for MTP and CUDAGraph by @EmmonsCurse in #2710
- add support QWQ enable_thinking by @lizexu123 in #2706
- [BugFix] fix paddle_git_commit_id error by @EmmonsCurse in #2714
- spec token map lazy. by @wtmlon in #2715
- fix bug. by @wtmlon in #2718
- 修改XPU CI, test=model by @quanxiang-liu in #2721
- [LLM] support multi node deploy by @ltd0924 in #2708
- [Doc]Update eb45-0.3B minimum memory requirement by @ckl117 in #2686
- [RL] Check if the controller port is available by @lddfym in #2724
- remove redundant install whl of fastdeploy by @yuanlehome in #2726
- support FastDeploy version setting by @XieYunshen in #2725
- [iluvatar_gpu] Adapt for iluvatar gpu by @liddk in #2684
- [Optimize] Optimize tensorwise fp8 performance by @ming1753 in #2729
- [Bug fix] fix complie bug when sm < 89 by @ming1753 in #2738
- [SOT] Remove BreakGraph with
paddle.maximum
by @DrRyanHuang in #2731 - 【Fearture】support qwen2 some func by @gzy19990617 in #2740
- [GCU] Support gcu platform by @EnflameGCU in #2702
- [Bug fix] Fixed the garbled text issues in Qwen3-8B by @lizexu123 in #2737
- [Bug fix] Add the missing
pod_ip
param to the launch_cache_manager function. by @Wanglongzhi2001 in #2742 - [Bug fix] fix attention rank init by @RichardWooSJTU in #2743
- add precision check for ci by @xiegetest in #2732
- [SOT] Make custom_op dy&st unified by @DrRyanHuang in #2733
- Revert "[Bug fix] fix attention rank init" by @RichardWooSJTU in #2761
- [GCU] Support for non-CUDA builds by @EnflameGCU in #2750
- 【Feature】add fd commit/branch info when start server by @gzy19990617 in #2752
- [Feature] Add speculative decoding simulation benchmark. by @Deleter-D in #2751
- [SOT] Enable SOT Dy2St in Multimodal Model by @DrRyanHuang in #2735
- [Bug fix] Fix ep_moe_expert_dispatch_fp8 in EP mode by @RichardWooSJTU in #2762
- [XPU] Fix the issue of garbled output for offline inference demo by @yulangz in #2763
- Feat/blackwell sm100 support by @celsowm in #2670
- [XPU] Supports BF16 for ERNIE-4.5-21B-A3B and ERNIE-4.5-0.3B by @yulangz in #2765
- [Feature] support custom all-reduce by @zhink in #2758
- Clear dead code And supplementary notes by @gongshaotian in #2757
- [DCU] dcu adapter ernie45t by @lifulll in #2756
- [Executor] Fix bug of logger.debug by @gongshaotian in #2778
- [Feature] block_wise_fp8 support triton_moe_backend by @ckl117 in #2767
- [Doc] modify offline inference docs by @LiqinruiG in #2747
- [Bug fix] Fixed the garbled text issues in Qwen3-8B by @lizexu123 in #2783
- [BugFix] Fix vocab size error for ernie model by @Jiang-Jia-Jun in #2785
- [Doc] modify offline_inference docs by @LiqinruiG in #2787
- [SOT] Add env variable
FLAGS_parameters_persistent_mode_in_dy2st
for dy2st by @0x3878f in #2779 - assert prompt len > 0 by @yuanlehome in #2773
- [SOT] Remove breakgraph in post processing && fix datatype by @DrRyanHuang in #2780
- [Feature]support top_k_top_p sampling by @Sunny-bot1 in #2753
- Rename top_p_sampling to top_k_top_p_sampling by @Sunny-bot1 in #2791
- [BugFix] Fix low prediction accuracy of deepseekv3 by @K11OntheBoat in #2798
- [Feature] Online Chat API Support Return logprobs by @ckl117 in #2777
- [XPU] Supports TP4 deployment on 4,5,6,7 by @yulangz in #2794
- [Doc] modify offline-inerence docs by @LiqinruiG in #2800
- [BugFix] fix RMSNorm rms_norm_esp by @lizexu123 in #2797
- [Executor] Move forward_meta.py to fastdeploy/model_executor by @littledgg in #2774
- [Fix]fix top_k_top_p sampling by @Sunny-bot1 in #2801
- Delete Useless Files by @iosmers in #2772
- [XPU] Update docker file by @yulangz in #2809
- Global scheduler supports configuring hot updates by @lddfym in #2807
- [BugFix] Fix rl config enable_logprob by @ckl117 in #2811
- [Feature] Support tensor-parallel-size>num_key_value_heads for qwen3 by @zhink in #2799
- [FIX]fix topp default value by @Sunny-bot1 in #2814
- [Feature]Support Qwen3-moe name_mapping by @gzy19990617 in #2820
- [Bug fix]fix num_blocks_local when small size model in TP2 running mode by @gzy19990617 in #2792
- Feature/logprob bug fix by @zhenwenDang in #2817
- [CI] add result save for ci by @xiegegege in #2824
- [vl]remove duplicated load logic by @bukejiyu in #2744
- [Bug Fix] fix spelling error by @lddfym in #2827
- [features][MTP]Support expert-parellel mode with MTP by @freeliuzc in #2835
- [Feature] Add DeepGEMM pre-compile tools. by @Deleter-D in #2819
- 【Update Docs】update supported_models doc by @chang-wenbin in #2836
- [Fix]fix 'force-reinstall all-depe-packages in build' by @freeliuzc in #2837
- Simplify the Config code by @YuanRisheng in #2770
- [Perf][MTP] Improve speculative decoding(MTP) efficiency by @freeliuzc in #2840
- [vl] Use top_k from config.json. by @bukejiyu in #2831
- 【Inference Optimize】Support wint2 triton kernel about triton_utils_v2 by @chang-wenbin in #2842
- 适配vLLM框架 by @ophilia-lee in #2851
- [Docs] add enable_logprob parameter description by @zhenwenDang in #2850
- Merge vl execution path into normal execution path by @zeroRains in #2829
- refactor rl get_name_mappings_to_training by @yuanlehome in #2847
- [Executor] CUDA Graph support padding batch by @gongshaotian in #2844
- [BugFix]Fix Configs by @YuanRisheng in #2849
- [XPU] Update doc and add scripts for downloading dependencies by @yulangz in #2845
- [Fix] Fix expert parallel config bug by @freeliuzc in #2848
- [BugFxi]Fix Config by @YuanRisheng in #2858
- [Fix] Fix mm ep weight init. by @xiaoxiaohehe001 in #2855
- [Fix] Fix FLAGS_max_partition_size by @yangjianfengo1 in #2854
- rl update by @yuanlehome in #2861
- [Trace] add opentelemetry by @sg263 in #2852
- [Attention] remove cum_offsets from atten, and use cu_seqlens_q by @zhoutianzi666 in #2870
- fix and refine vl by @yuanlehome in #2866
- [LLM] support send batch data and aggregate data by @ltd0924 in #2860
- [LLM] Update Multinode Deployment by @ltd0924 in #2830
- [Trace] fix annotation when add opentelemetry by @sg263 in #2869
- [Feature] Add speculative decoding metrics. by @Deleter-D in #2857
- [XPU][doc] Update minimal fastdeploy required by @yulangz in #2863
- [Feature] support prompt repetition_penalty by @ming1753 in #2806
- Fix rollout_model init by @yuanlehome in #2881
- [MM_PROCESS] add _extract_labels by @LokeZhou in #2879
- [LLM] Add parameter validation and exception throwing by @ltd0924 in #2878
- Enable CI workflow for pull requests targeting release/* branches by @XieYunshen in #2887
- [Fix] remove misleading variables by @YuCosine in #2841
- [Bug Fix] fix bug of prompt penalty by @ming1753 in #2888
- [Feature][MTP] Support cacheKV transfer in per_chunk mode by @freeliuzc in #2890
- [Inference, rename] remove padding_offsets from atten use batch_id_per_token by @zhoutianzi666 in #2880
- [XPU][doc] fix typo by @yulangz in #2892
- [LLM] delete fixed slots by @ltd0924 in #2893
- [Executor] Updated the documents related to cuda graph and static graph. by @gongshaotian in #2898
- [Trace]fix opentelemetry can not work in uvicorn by @sg263 in #2906
- [BugFix]Fix sample rejection by @YuanRisheng in #2908
- [XPU] Remove padding_offsets from get_padding_offset.cu by @zhoutianzi666 in #2911
- [Executor] Fix set capture sizes bug by @gongshaotian in #2902
- Enable ci release by @XieYunshen in #2896
- remove cum_offsets from get_block_shape_and_split_kv_block by @zhoutianzi666 in #2913
- 【Feature】support vl model name_mapping and ori_vocab_size by @gzy19990617 in #2900
- [Feature] Support include_stop_str_in_output in chat/completion by @Jiang-Jia-Jun in #2910
- [Feature] Support 45tVL EP FP8 Infer. by @xiaoxiaohehe001 in #2909
- [Bug Fix] fix ep config bug by @ming1753 in #2920
- Update CI cases by @ZhangYulongg in #2916
- Polish code with new pre-commit rules by @zeroRains in #2923
- remove cum_offsets from ForwardMeta by @zhoutianzi666 in #2925
- [Iluvatar GPU] Add CI scripts by @liddk in #2876
- [LLM] delete unused code by @ltd0924 in #2931
- Rename test dir by @YuanRisheng in #2934
- [Feature]support trainer_degree in name_mapping by @gzy19990617 in #2935
- [Feature] support min_p_sampling by @lizexu123 in #2872
- use dist.all_reduce(min) to sync num_blocks_local by @yuanlehome in #2933
- [Executor] Avoid OOM when start the service while Enable Chunked Prefill + CudaGraph by @littledgg in #2936
- [Feature] Add return_token_ids, prompt_token_ids, and delete training, raw_request in request body by @liyonghua0910 in #2940
- remove some code in ep.py by @zhoutianzi666 in #2947
- custom all reduce support cuda graph by @zhink in #2938
- [Polish] Return error message of raw_request by @Jiang-Jia-Jun in #2946
- [BugFix] Rename attention params of deepseekv3 by @K11OntheBoat in #2939
- [Fix]Fix vl when import fastdeploy by @gzy19990617 in #2944
- 【DCU】update top_p_sampling by @lifulll in #2901
- [Fix] non-streaming api now returns full output ids if return_token_ids is enabled by @liyonghua0910 in #2951
- [Feature] DeepseekV3 use pd_build_static_op by @K11OntheBoat in #2948
- [SOT] Mark dynamic dims by type annotations by @SigureMo in #2771
- [Feature] Support using prefix-caching + cudagraph for inference by @zeroRains in #2924
- [Fix]fix empty prompt_token_ids,update the parser's triggering condit… by @luukunn in #2891
- [Feature] Support multi-step MTP. by @Deleter-D in #2952
- [Feature] Marlin MoE backend supports DeepseekV3 by @K11OntheBoat in #2962
- [Fix] Fix code and register cpp operators by @Deleter-D in #2965
- [Bugfix]fix rl config local rank by @gzy19990617 in #2957
- [FIX]fix rejection sampling when topp=0 using _SAMPLING_EPS by @Sunny-bot1 in #2967
- [SOT] Add sot warmup (NVIDIA GPU Only) by @DrRyanHuang in #2929
- [Fearure] support chunk_prefill in fa3 by @lizhenyun01 in #2975
- 【Infer】Improve the performance block_wise_fp8 of triton_moe_backend by @ckl117 in #2942
- [Code Simplification]delete max-len by @lizexu123 in #2959
- [CI] add codestyle_check action by @EmmonsCurse in #2972
- [Fix][MTP] fix mtp bug in pd-split mode by @freeliuzc in #2970
- [BugFix] Add prefill restrictions for chunked_prefill+VL by @zeroRains in #2983
- Fix performance degradation bug of custom_all_reduce by @zhink in #2981
- [Bug Fix] Fix hidden_size in FA by @ckl117 in #2987
- polish code for prefill restrictions by @zeroRains in #2991
- [Feature] Support block scheduler v1 for FD by @rainyfly in #2928
- eblp reload export by @bukejiyu in #2978
- [Code Simplification] fix init_distributed_environment() by @lizexu123 in #2982
- [Feature] support c4 attn && fix cache by @lizhenyun01 in #2996
- [benchmark] add quantization for benchmark yaml by @xiegegege in #2995
- [BugFix] fix mm ep empty run. by @xiaoxiaohehe001 in #2999
- add ci reuse action by @XieYunshen in #2968
- [Feature] multi-source download by @Yzc216 in #2986
- [LLM] update function name by @ltd0924 in #2985
- [BugFix] fix multinode deployment by @ltd0924 in #2977
- Update benchmark tools by @ZhangYulongg in #3004
- update flake8 version to support pre-commit in python3.12 by @zeroRains in #3000
- [Feature] multi source download by @Yzc216 in #3005
- [GCU] Update to develop by @EnflameGCU in #2988
- [Model] Provide clearer error for missing KV cache quantization scales by @littledgg in #3007
- [Feature] Support_eplb by @xiaoxiaohehe001 in #2997
- [GCU] Add CI by @EnflameGCU in #3006
- [GCU] Update post_process by @EnflameGCU in #3012
- [CI] fix codestyle_check by @EmmonsCurse in #3015
- [feature] Support FA2 by @ckl117 in #3009
- [Feat] support mixed ep by @Wanglongzhi2001 in #2969
- [feat] add disable_chat_template in chat api as a substitute for previous raw_request by @liyonghua0910 in #3020
- modified dockerfile by @XieYunshen in #3026
- Add unit test run and coverage report generation by @XieYunshen in #3011
- Optimize the performance of moe_expert_ffn_wint2 by @Xreki in #2990
- Unify server-side and model-side Config (Part1) by @YuanRisheng in #3018
- [Bug fix] Fix arguement error in PD + EP by @Wanglongzhi2001 in #3030
- [Perf] Remove unnecessary operations in non-cuda_graph by @begin2023 in #3010
- 【Bug Fix】MTP rejection_topp add topk input by @ckl117 in #3031
- [BugFix] fix c4 prompt_cache by @lizhenyun01 in #3033
- fix(ci): correct diff coverage data download URL by @XieYunshen in #3036
- [Test] Add error information by @iosmers in #3040
- Unify server-side and model-side Config (Part2) by @YuanRisheng in #3035
- 【Inference Optimize】Update wint2 weight n-dim reorder by @chang-wenbin in #3042
- [Feature] DeepseekV3 supports cuda graph by @K11OntheBoat in #3041
- add logprob ci test by @XieYunshen in #3022
- [fix] w4a8 model loading and hadamard config by @rsmallblue in #3013
- [BugFix] remove hadamard's sync by @lizhenyun01 in #3048
- optimize w4a8 decoding by @rsmallblue in #3050
- [XPU] Support kvblock centralized management by @iosmers in #3017
- Fix Speculative Config bug by @YuanRisheng in #3049
- [stop sequence] support stop sequence by @zoooo0820 in #3025
- [Bug fix] Fix ep when paddle version mismatch by @Wanglongzhi2001 in #3056
- Unify server-side and model-side Config (Part3) by @YuanRisheng in #3047
- [Bug fix] Fix arguement error in mixed_ep when pd by @Wanglongzhi2001 in #3060
- support model loading for w4a8 offline quant by @rsmallblue in #3064
- [Feature] Support repetition early stop by @zeroRains in #3024
- [SOT] Extend SOT warmup support to new hardware by @DrRyanHuang in #3032
- 修复传入max_tokens=1时的报错 by @AuferGachet in #3068
- [Docs]add sampling docs by @Sunny-bot1 in #2973
- [Feature]support bad_words by @Sunny-bot1 in #3055
- update doc: load_balance.md by @lddfym in #3008
- Unify server-side and model-side Config(Part-4) by @YuanRisheng in #3070
- [Doc] add repetition early stopping doc by @zeroRains in #3078
- [Feature] multi source download by @Yzc216 in #3072
- support W4A8 EPLB by @rsmallblue in #3075
- Add uinttest for moe_ffn_wint2. by @Xreki in #3037
- delete unused unittest by @YuanRisheng in #3065
- [BugFix] vl encoder tokens dtype problem by @ltd0924 in #3069
- [Fix] Fix version function by @Jiang-Jia-Jun in #3076
- [Feature] Multimodal Scheduler V1 by @ming1753 in #3019
- w4a8 offline by @bukejiyu in #3074
- adapter qwen3 moe attr for init by @zhink in #3066
- Add ci for custom op approve by @YuanRisheng in #3079
- Revert "Add uinttest for moe_ffn_wint2." by @chang-wenbin in #3085
- [feat]support qwen3 using new loader by @bukejiyu in #3057
- [feat] extra parameters are all passed directly via http payload now, or in extra_body if using openai client by @liyonghua0910 in #3058
- fix the memory leak when modify qp to rts failed by @huzhida in #3051
- logprob debug use by @XieYunshen in #3095
- [doc] add stop_seqs doc by @zoooo0820 in #3090
- [Feature] support ep in mixed mode by @ltd0924 in #3001
- [BugFix]Fix ep size by @YuanRisheng in #3092
- [Feature] Support include_stop_str_in_output in completion api by @Jiang-Jia-Jun in #3096
- [BUG FIX] Fix bug when preempted request rescheduled by @rainyfly in #3080
- [Executor] Refactor GetBlockShapeAndSplitKVBlock Kernel by @gongshaotian in #2989
- add approve ci by @XieYunshen in #3093
- [Doc] Release fastdeploy-xpu 2.0.3 by @iosmers in #3105
- [doc] best practice for eb45 text models by @zoooo0820 in #3002
- fix ci by @XieYunshen in #3106
- [Bug Fix] Fix bug for offline inference in scheduler v1 by @rainyfly in #3117
- [Feature] block scheduler v1 support prefix caching by @kevincheng2 in #3061
- [Doc] add chat_template_kwagrs and update params docs by @LiqinruiG in #3103
- fix is_permuted by @rsmallblue in #3098
- Fix test_EB_Lite_serving.py by @ZhangYulongg in #3119
- [BUG] Fix bug for pd in fd by @rainyfly in #3034
- [Feature] General support for logprobs by @sunlei1024 in #2974
- [BugFix] fix request_output sampling_params in PD by @ckl117 in #3154
- [Executor]Fix get_block_shape_and_split_kv_block Kernel typo by @gongshaotian in #3153
- [Bug Fix] fix pd disaggregated kv cache signal by @ltd0924 in #3173
- [cherry-pick]fix load_pre_sharded_checkpoint (#3152) by @bukejiyu in #3169
- [XPU]Fix out-of-memory issue during single-XPU deployment by @iosmers in #3131
- [XPU] Update XPU dockerflie by @plusNew001 in #3147
- Apply CI fix from Develop by @XieYunshen in #3151
- [Bug Fix] Fix bug of MLA Attention Backend by @gongshaotian in #3178
- [Bugfix] Fix uninitialized decoded_token and add corresponding unit test by @sunlei1024 in #3201
- [BugFix] support real batch_size by @lizexu123 in #3217
- [FIX 2.1]fix bad_words when sending requests consecutively by @Sunny-bot1 in #3199
- [Trace]merge develop trace FD_START by @sg263 in #3253
- [Cherry-pick] fix stop seq by @zoooo0820 in #3263
- [BugFix] fix too many open files problem by @ltd0924 in #3275
- [XPU] Revert PR 3217 by @iosmers in #3286
- [BugFix] fix ep by @lizexu123 in #3290
- fix scheduler bug in release2.1 by @rainyfly in #3295
- Revert "[BugFix] fix ep " by @Jiang-Jia-Jun in #3317
- [Bug Fix] fix uvicorn multi worker error by @kevincheng2 in #3309
- fix ci pypi index error by @XieYunshen in #3327
- [Docs]fix sampling docs 2.1 by @Sunny-bot1 in #3333
- [Bug Fix] fix vl V1 schedule bug by @ming1753 in #3284
- Fix block num in schduelr v1 for release 2.1 by @rainyfly in #3315
- [Bug fix] fix bug for scheduler v0 by @rainyfly in #3306
- Remove useless code release/2.1 by @Jiang-Jia-Jun in #3338
- [BugFix]fix mapping by @gzy19990617 in #3322
- Release/2.1 by @memoryCoderC in #3361
- Completion add raw_prediction/text_after_process by @memoryCoderC in #3362
- Use latest PaddlePaddle package (#3347) by @XieYunshen in #3352
- Pre ce modified (#3335) by @XieYunshen in #3360
- [Bug Fix] Fix V1 video bug by @ming1753 in #3387
- [Cherry-pick] fix stopseq error info by @zoooo0820 in #3342
- [BugFix] Fix default log level of paddleformers by @Jiang-Jia-Jun in #3377
- feat(log):add_request_and_response_log by @xiaolei373 in #3392
- Optimize CI execution workflow. (#3371) by @XieYunshen in #3384
- [BugFix] fix control signal release failed by @ltd0924 in #3374
- [XPU] Fixed the issue of performance degradation caused by enabling ENABLE_V1_KVCACHE_SCHEDULER by @iosmers in #3393
- [BugFix] fix ErnieProcessor not set raw_prediction by @memoryCoderC in #3401
- [Doc]Release fastdeploy-xpu 2.1.0 by @iosmers in #3407
- [Doc]Release fastdeploy-xpu 2.1.0 by @iosmers in #3408
New Contributors
- @ZhangHandi made their first contribution in #2673
- @LiqinruiG made their first contribution in #2696
- @quanxiang-liu made their first contribution in #2701
- @wtmlon made their first contribution in #2715
- @xiegetest made their first contribution in #2732
- @celsowm made their first contribution in #2670
- @lifulll made their first contribution in #2756
- @0x3878f made their first contribution in #2779
- @K11OntheBoat made their first contribution in #2798
- @xiegegege made their first contribution in #2824
- @ophilia-lee made their first contribution in #2851
- @YuCosine made their first contribution in #2841
- @SigureMo made their first contribution in #2771
- @Xreki made their first contribution in #2990
- @begin2023 made their first contribution in #3010
- @AuferGachet made their first contribution in #3068
- @huzhida made their first contribution in #3051
Full Changelog: v2.0.0...v2.1.0