v2.1.0

@ZhangHandi

FastDeploy v2.1.0通过升级KVCache调度机制、增强高并发场景能力以及丰富采样策略，进一步提升用户体验和服务稳定性；通过CUDA Graph以及MTP等多项优化提升推理性能；此外，还新增支持多款国产硬件上文心开源模型的推理能力。

使用体验优化

KVCache调度机制升级：采用输入与输出的KVCache统一管理方式，解决此前由于kv_cache_ratio参数配置不当导致的OOM问题；解决多模态模型由于输出KVCache不足，生成提前结束的问题。部署时通过配置环境变量export ENABLE_V1_KVCACHE_SCHEDULER=1启用（下个版本会默认开启），即可不再依赖kv_cache_ratio的设置，推荐使用。
高并发场景功能增强：增加max_concurrency/max_waiting_time控制并发，对于超时请求进行拒绝优化用户体验，保障服务稳定性。
多样的采样方式支持：新增min_p、top_k_top_p采样方式支持，使用方式参考采样说明；同时增加基于Repetition策略和基于stop词列表早停能力，详见早停说明。
服务化部署能力提升：增加return_token_ids/include_stop_str_in_output/logprobs等参数支持返回更完整的推理信息。
默认参数下性能提升：增强因max_num_seqs默认值与实际并发不一致时性能下降问题，避免手动修改max_num_seqs。

推理性能优化

CUDA Graph覆盖更多场景：覆盖多卡推理，支持与上下文缓存、Chunked Prefill同时使用，在ERNIE 4.5系列、Qwen3系列模型上性能提升17%~91%，详细使用可以参考最佳实践文档。
MTP投机解码性能提升 ：优化算子性能，减少CPU调度开销，提升整体性能；同时，相比v2.0.0版本新增ERNIE-4.5-21B-A3B模型支持MTP投机解码。
算子性能优化：优化W4A8、 KVCache INT4、WINT2 Group GEMM等计算Kernel，提升性能；如ERNIE-4.5-300B-A47B WINT2模型性能提升25.5%。
PD分离完成更多模型验证：P节点完善FlashAttention后端，提升长文推理性能，并基于ERNIE-4.5-21B-A3B等轻量模型完成验证。

国产硬件部署能力升级

新增支持昆仑芯P800上ERNIE-4.5-21B-A3B模型部署，更多说明参考昆仑芯P800部署文档。
新增支持海光K100-AI上ERNIE4.5文本系列模型部署，更多说明参考海光K100-AI部署文档。
新增支持燧原S60上ERNIE4.5文本系列模型的部署，更多说明参考燧原S60部署文档。
新增支持天数天垓150上ERNIE-4.5-300B-A47B和ERNIE-4.5-21B-A3B模型部署，并优化推理性能，更多说明参考天数部署文档。

ERNIE4.5 模型国产硬件推理适配情况（✅ 已支持 🚧 适配中 ⛔暂无计划）
模型	昆仑芯P800	昇腾910B	海光K100-AI	天数天垓150	沐曦曦云C550	燧原S60/L600
ERNIE4.5-VL-424B-A47B	🚧	🚧	⛔	⛔	⛔	⛔
ERNIE4.5-300B-A47B	✅	🚧	✅	✅	🚧	✅
ERNIE4.5-VL-28B-A3B	🚧	🚧	⛔	🚧	⛔	⛔
ERNIE4.5-21B-A3B	✅	🚧	✅	✅	✅	✅
ERNIE4.5-0.3B	✅	🚧	✅	✅	✅	✅

相关文档和说明

升级对飞桨框架的依赖**，FastDeploy v2.1.0版本依赖PaddlePaddle v3.1.1版本**，PaddlePaddle安装方式请参考飞桨官网安装说明
FastDeploy v2.1.0的服务部署请求不再推荐使用metadata字段（Deprecated，v2.1.0版本可使用，未来会移除），更新为使用extra_body，详见参数支持说明
FastDeploy多硬件安装和编译说明
FastDeploy部署参数
服务化部署使用说明
GPU部署最佳实践

更详细的说明列举如下，

新增功能
- PD分离D服务支持W4A8在线/离线量化
- PD分离开启Chunked Prefill下支持逐Chunk的KVCache传输
- 支持logprobs返回
- 支持OpenTelemetry采集请求处理状态
- 新增return_token_ids参数，支持返回请求的输入和输出Token ID列表
- 新增include_stop_str_in_output参数，支持结束符的返回
- 新增QwQ模型 enable_thinking参数控制思考模式开关
- 新增repetition早停功能支持
- 新增stop参数支持
- 新增多机张量并行部署支持
- 新增服务请求并发与超时控制
- 支持min_p/top_k_top_p采样
- 支持bad_words
- 优化OpenAI API-Server接口，支持extra_body扩充额外参数支持，废弃metadata的使用
性能优化
- PD分离EP并行下Decode的W4A8计算性能优化
- 基于权重重排优化WINT2 Group-GEMM算子Kernel性能
- MTP优化下支持开启Chunked Prefill
- 优化MTP & 投机解码推理性能
- 基于Triton优化Blockwise FP8量化性能
- CUDA Graph 支持 Padding Batch，显存占用大幅减少
- 新增Custom All Reduce算子，CUDA Graph支持TP并行
- 支持Chunked Prefill下开启CUDA Graph
- GetBolockShapeAndSplitKVBlock算子性能优化
- Attention支持C4非对称量化推理
- FlashAttn后端适配TP并行及支持FlashAttention V2
- KVCache管理机制升级，当前仅支持GPU，通过export ENABLE_V1_KVCACHE_SCHEDULER=1启用
- FlashAttention V3下支持开启C16/C8/C4的Chunked Prefill优化
- 服务部署支持Engine自动聚合生成结果提升服务与客户端通信效率
多硬件支持
- 昆仑芯 P800支持ERNIE-21B-A3B Wint4/Wint8模型
- 海光K100-AI支持ERNIE-4.5-300B-A47B & ERNIE-4.5-21B-A3B模型
- 燧原S60支持ERNIE4.5系列模型
- 天数支持ERNIE-4.5-300B-A47B & ERNIE-4.5-21B-A3B模型，并进行性能优化
Bug修复
- 修复PD分离部署架构开启MTP时D服务首Token错误问题
- 修复SFT后文心纯文模型Token采样越界问题
- 修复XPU 非0卡启动显存OOM的问题
- 修复XPU使用ENABLE_V1_KVCACHE_SCHEDULER=1性能下降问题
- 修复Chunked Prefill下多模态模型并发推理模型崩溃问题
- 修复Qwen3-8B模型生成结果乱码的问题
- 修复RMSNorm硬编码的问题
- 修复linear.py中qkv_bias没定义的问题
- 修复max_tokens=1时报错的问题
- 修复token_processor输入日志格式问题
- 修复chunked_prefill下，chunk size 小于block size服务hang问题
- 修复vl 场景下数据保存问题
文档
- 增加中文ReadME和MKDocs支持
- 新增各模型部署最佳实践文档
- 增加Sampling和Early Stopping使用文档说明
- 更新CUDA Graph与动转静使用接口和文档
- 更新模型支持文档
其它
- 新增中英文文档支持
- 优化模型加载量化模型参数报错信息
- 统一多模态模型和纯文模型的ModelRunner
- 基于triton_utils更新WINT2 Triton算子
- 优化代码中多个Config实现繁杂问题

What's Changed

add wint2 performance by @ZhangHandi in #2673
Update gh-pages.yml by @DDDivano in #2680
add --force-reinstall --no-cache-dir when pip install fastdeploy*.whl by @yuanlehome in #2682
[Sync] Update to latest code by @Jiang-Jia-Jun in #2679
[doc] update docs by @kevincheng2 in #2690
[Bug] fix logger format by @ltd0924 in #2689
[feat] support fa3 backend for pd disaggregated by @yuanlehome in #2695
add quick benchmark script by @DDDivano in #2703
[Doc] modify reasoning_output docs by @LiqinruiG in #2696
[MTP] Support chunked_prefill in speculative decoding(MTP) by @freeliuzc in #2705
[RL] update reschedule finish reason by @ltd0924 in #2709
[feature]add fd whl version info by @gzy19990617 in #2698
Extract eh_proj Layer from ParallelLMHead for MTP to Avoid Weight Transposition Issue by @Deleter-D in #2707
添加XPU CI, test=model by @quanxiang-liu in #2701
[CI] Add validation for MTP and CUDAGraph by @EmmonsCurse in #2710
add support QWQ enable_thinking by @lizexu123 in #2706
[BugFix] fix paddle_git_commit_id error by @EmmonsCurse in #2714
spec token map lazy. by @wtmlon in #2715
fix bug. by @wtmlon in #2718
修改XPU CI, test=model by @quanxiang-liu in #2721
[LLM] support multi node deploy by @ltd0924 in #2708
[Doc]Update eb45-0.3B minimum memory requirement by @ckl117 in #2686
[RL] Check if the controller port is available by @lddfym in #2724
remove redundant install whl of fastdeploy by @yuanlehome in #2726
support FastDeploy version setting by @XieYunshen in #2725
[iluvatar_gpu] Adapt for iluvatar gpu by @liddk in #2684
[Optimize] Optimize tensorwise fp8 performance by @ming1753 in #2729
[Bug fix] fix complie bug when sm < 89 by @ming1753 in #2738
[SOT] Remove BreakGraph with paddle.maximum by @DrRyanHuang in #2731
【Fearture】support qwen2 some func by @gzy19990617 in #2740
[GCU] Support gcu platform by @EnflameGCU in #2702
[Bug fix] Fixed the garbled text issues in Qwen3-8B by @lizexu123 in #2737
[Bug fix] Add the missing pod_ip param to the launch_cache_manager function. by @Wanglongzhi2001 in #2742
[Bug fix] fix attention rank init by @RichardWooSJTU in #2743
add precision check for ci by @xiegetest in #2732
[SOT] Make custom_op dy&st unified by @DrRyanHuang in #2733
Revert "[Bug fix] fix attention rank init" by @RichardWooSJTU in #2761
[GCU] Support for non-CUDA builds by @EnflameGCU in #2750
【Feature】add fd commit/branch info when start server by @gzy19990617 in #2752
[Feature] Add speculative decoding simulation benchmark. by @Deleter-D in #2751
[SOT] Enable SOT Dy2St in Multimodal Model by @DrRyanHuang in #2735
[Bug fix] Fix ep_moe_expert_dispatch_fp8 in EP mode by @RichardWooSJTU in #2762
[XPU] Fix the issue of garbled output for offline inference demo by @yulangz in #2763
Feat/blackwell sm100 support by @celsowm in #2670
[XPU] Supports BF16 for ERNIE-4.5-21B-A3B and ERNIE-4.5-0.3B by @yulangz in #2765
[Feature] support custom all-reduce by @zhink in #2758
Clear dead code And supplementary notes by @gongshaotian in #2757
[DCU] dcu adapter ernie45t by @lifulll in #2756
[Executor] Fix bug of logger.debug by @gongshaotian in #2778
[Feature] block_wise_fp8 support triton_moe_backend by @ckl117 in #2767
[Doc] modify offline inference docs by @LiqinruiG in #2747
[Bug fix] Fixed the garbled text issues in Qwen3-8B by @lizexu123 in #2783
[BugFix] Fix vocab size error for ernie model by @Jiang-Jia-Jun in #2785
[Doc] modify offline_inference docs by @LiqinruiG in #2787
[SOT] Add env variable FLAGS_parameters_persistent_mode_in_dy2st for dy2st by @0x3878f in #2779
assert prompt len > 0 by @yuanlehome in #2773
[SOT] Remove breakgraph in post processing && fix datatype by @DrRyanHuang in #2780
[Feature]support top_k_top_p sampling by @Sunny-bot1 in #2753
Rename top_p_sampling to top_k_top_p_sampling by @Sunny-bot1 in #2791
[BugFix] Fix low prediction accuracy of deepseekv3 by @K11OntheBoat in #2798
[Feature] Online Chat API Support Return logprobs by @ckl117 in #2777
[XPU] Supports TP4 deployment on 4,5,6,7 by @yulangz in #2794
[Doc] modify offline-inerence docs by @LiqinruiG in #2800
[BugFix] fix RMSNorm rms_norm_esp by @lizexu123 in #2797
[Executor] Move forward_meta.py to fastdeploy/model_executor by @littledgg in #2774
[Fix]fix top_k_top_p sampling by @Sunny-bot1 in #2801
Delete Useless Files by @iosmers in #2772
[XPU] Update docker file by @yulangz in #2809
Global scheduler supports configuring hot updates by @lddfym in #2807
[BugFix] Fix rl config enable_logprob by @ckl117 in #2811
[Feature] Support tensor-parallel-size>num_key_value_heads for qwen3 by @zhink in #2799
[FIX]fix topp default value by @Sunny-bot1 in #2814
[Feature]Support Qwen3-moe name_mapping by @gzy19990617 in #2820
[Bug fix]fix num_blocks_local when small size model in TP2 running mode by @gzy19990617 in #2792
Feature/logprob bug fix by @zhenwenDang in #2817
[CI] add result save for ci by @xiegegege in #2824
[vl]remove duplicated load logic by @bukejiyu in #2744
[Bug Fix] fix spelling error by @lddfym in #2827
[features][MTP]Support expert-parellel mode with MTP by @freeliuzc in #2835
[Feature] Add DeepGEMM pre-compile tools. by @Deleter-D in #2819
【Update Docs】update supported_models doc by @chang-wenbin in #2836
[Fix]fix 'force-reinstall all-depe-packages in build' by @freeliuzc in #2837
Simplify the Config code by @YuanRisheng in #2770
[Perf][MTP] Improve speculative decoding(MTP) efficiency by @freeliuzc in #2840
[vl] Use top_k from config.json. by @bukejiyu in #2831
【Inference Optimize】Support wint2 triton kernel about triton_utils_v2 by @chang-wenbin in #2842
适配vLLM框架 by @ophilia-lee in #2851
[Docs] add enable_logprob parameter description by @zhenwenDang in #2850
Merge vl execution path into normal execution path by @zeroRains in #2829
refactor rl get_name_mappings_to_training by @yuanlehome in #2847
[Executor] CUDA Graph support padding batch by @gongshaotian in #2844
[BugFix]Fix Configs by @YuanRisheng in #2849
[XPU] Update doc and add scripts for downloading dependencies by @yulangz in #2845
[Fix] Fix expert parallel config bug by @freeliuzc in #2848
[BugFxi]Fix Config by @YuanRisheng in #2858
[Fix] Fix mm ep weight init. by @xiaoxiaohehe001 in #2855
[Fix] Fix FLAGS_max_partition_size by @yangjianfengo1 in #2854
rl update by @yuanlehome in #2861
[Trace] add opentelemetry by @sg263 in #2852
[Attention] remove cum_offsets from atten, and use cu_seqlens_q by @zhoutianzi666 in #2870
fix and refine vl by @yuanlehome in #2866
[LLM] support send batch data and aggregate data by @ltd0924 in #2860
[LLM] Update Multinode Deployment by @ltd0924 in #2830
[Trace] fix annotation when add opentelemetry by @sg263 in #2869
[Feature] Add speculative decoding metrics. by @Deleter-D in #2857
[XPU][doc] Update minimal fastdeploy required by @yulangz in #2863
[Feature] support prompt repetition_penalty by @ming1753 in #2806
Fix rollout_model init by @yuanlehome in #2881
[MM_PROCESS] add _extract_labels by @LokeZhou in #2879
[LLM] Add parameter validation and exception throwing by @ltd0924 in #2878
Enable CI workflow for pull requests targeting release/* branches by @XieYunshen in #2887
[Fix] remove misleading variables by @YuCosine in #2841
[Bug Fix] fix bug of prompt penalty by @ming1753 in #2888
[Feature][MTP] Support cacheKV transfer in per_chunk mode by @freeliuzc in #2890
[Inference, rename] remove padding_offsets from atten use batch_id_per_token by @zhoutianzi666 in #2880
[XPU][doc] fix typo by @yulangz in #2892
[LLM] delete fixed slots by @ltd0924 in #2893
[Executor] Updated the documents related to cuda graph and static graph. by @gongshaotian in #2898
[Trace]fix opentelemetry can not work in uvicorn by @sg263 in #2906
[BugFix]Fix sample rejection by @YuanRisheng in #2908
[XPU] Remove padding_offsets from get_padding_offset.cu by @zhoutianzi666 in #2911
[Executor] Fix set capture sizes bug by @gongshaotian in #2902
Enable ci release by @XieYunshen in #2896
remove cum_offsets from get_block_shape_and_split_kv_block by @zhoutianzi666 in #2913
【Feature】support vl model name_mapping and ori_vocab_size by @gzy19990617 in #2900
[Feature] Support include_stop_str_in_output in chat/completion by @Jiang-Jia-Jun in #2910
[Feature] Support 45tVL EP FP8 Infer. by @xiaoxiaohehe001 in #2909
[Bug Fix] fix ep config bug by @ming1753 in #2920
Update CI cases by @ZhangYulongg in #2916
Polish code with new pre-commit rules by @zeroRains in #2923
remove cum_offsets from ForwardMeta by @zhoutianzi666 in #2925
[Iluvatar GPU] Add CI scripts by @liddk in #2876
[LLM] delete unused code by @ltd0924 in #2931
Rename test dir by @YuanRisheng in #2934
[Feature]support trainer_degree in name_mapping by @gzy19990617 in #2935
[Feature] support min_p_sampling by @lizexu123 in #2872
use dist.all_reduce(min) to sync num_blocks_local by @yuanlehome in #2933
[Executor] Avoid OOM when start the service while Enable Chunked Prefill + CudaGraph by @littledgg in #2936
[Feature] Add return_token_ids, prompt_token_ids, and delete training, raw_request in request body by @liyonghua0910 in #2940
remove some code in ep.py by @zhoutianzi666 in #2947
custom all reduce support cuda graph by @zhink in #2938
[Polish] Return error message of raw_request by @Jiang-Jia-Jun in #2946
[BugFix] Rename attention params of deepseekv3 by @K11OntheBoat in #2939
[Fix]Fix vl when import fastdeploy by @gzy19990617 in #2944
【DCU】update top_p_sampling by @lifulll in #2901
[Fix] non-streaming api now returns full output ids if return_token_ids is enabled by @liyonghua0910 in #2951
[Feature] DeepseekV3 use pd_build_static_op by @K11OntheBoat in #2948
[SOT] Mark dynamic dims by type annotations by @SigureMo in #2771
[Feature] Support using prefix-caching + cudagraph for inference by @zeroRains in #2924
[Fix]fix empty prompt_token_ids,update the parser's triggering condit… by @luukunn in #2891
[Feature] Support multi-step MTP. by @Deleter-D in #2952
[Feature] Marlin MoE backend supports DeepseekV3 by @K11OntheBoat in #2962
[Fix] Fix code and register cpp operators by @Deleter-D in #2965
[Bugfix]fix rl config local rank by @gzy19990617 in #2957
[FIX]fix rejection sampling when topp=0 using _SAMPLING_EPS by @Sunny-bot1 in #2967
[SOT] Add sot warmup (NVIDIA GPU Only) by @DrRyanHuang in #2929
[Fearure] support chunk_prefill in fa3 by @lizhenyun01 in #2975
【Infer】Improve the performance block_wise_fp8 of triton_moe_backend by @ckl117 in #2942
[Code Simplification]delete max-len by @lizexu123 in #2959
[CI] add codestyle_check action by @EmmonsCurse in #2972
[Fix][MTP] fix mtp bug in pd-split mode by @freeliuzc in #2970
[BugFix] Add prefill restrictions for chunked_prefill+VL by @zeroRains in #2983
Fix performance degradation bug of custom_all_reduce by @zhink in #2981
[Bug Fix] Fix hidden_size in FA by @ckl117 in #2987
polish code for prefill restrictions by @zeroRains in #2991
[Feature] Support block scheduler v1 for FD by @rainyfly in #2928
eblp reload export by @bukejiyu in #2978
[Code Simplification] fix init_distributed_environment() by @lizexu123 in #2982
[Feature] support c4 attn && fix cache by @lizhenyun01 in #2996
[benchmark] add quantization for benchmark yaml by @xiegegege in #2995
[BugFix] fix mm ep empty run. by @xiaoxiaohehe001 in #2999
add ci reuse action by @XieYunshen in #2968
[Feature] multi-source download by @Yzc216 in #2986
[LLM] update function name by @ltd0924 in #2985
[BugFix] fix multinode deployment by @ltd0924 in #2977
Update benchmark tools by @ZhangYulongg in #3004
update flake8 version to support pre-commit in python3.12 by @zeroRains in #3000
[Feature] multi source download by @Yzc216 in #3005
[GCU] Update to develop by @EnflameGCU in #2988
[Model] Provide clearer error for missing KV cache quantization scales by @littledgg in #3007
[Feature] Support_eplb by @xiaoxiaohehe001 in #2997
[GCU] Add CI by @EnflameGCU in #3006
[GCU] Update post_process by @EnflameGCU in #3012
[CI] fix codestyle_check by @EmmonsCurse in #3015
[feature] Support FA2 by @ckl117 in #3009
[Feat] support mixed ep by @Wanglongzhi2001 in #2969
[feat] add disable_chat_template in chat api as a substitute for previous raw_request by @liyonghua0910 in #3020
modified dockerfile by @XieYunshen in #3026
Add unit test run and coverage report generation by @XieYunshen in #3011
Optimize the performance of moe_expert_ffn_wint2 by @Xreki in #2990
Unify server-side and model-side Config (Part1) by @YuanRisheng in #3018
[Bug fix] Fix arguement error in PD + EP by @Wanglongzhi2001 in #3030
[Perf] Remove unnecessary operations in non-cuda_graph by @begin2023 in #3010
【Bug Fix】MTP rejection_topp add topk input by @ckl117 in #3031
[BugFix] fix c4 prompt_cache by @lizhenyun01 in #3033
fix(ci): correct diff coverage data download URL by @XieYunshen in #3036
[Test] Add error information by @iosmers in #3040
Unify server-side and model-side Config (Part2) by @YuanRisheng in #3035
【Inference Optimize】Update wint2 weight n-dim reorder by @chang-wenbin in #3042
[Feature] DeepseekV3 supports cuda graph by @K11OntheBoat in #3041
add logprob ci test by @XieYunshen in #3022
[fix] w4a8 model loading and hadamard config by @rsmallblue in #3013
[BugFix] remove hadamard's sync by @lizhenyun01 in #3048
optimize w4a8 decoding by @rsmallblue in #3050
[XPU] Support kvblock centralized management by @iosmers in #3017
Fix Speculative Config bug by @YuanRisheng in #3049
[stop sequence] support stop sequence by @zoooo0820 in #3025
[Bug fix] Fix ep when paddle version mismatch by @Wanglongzhi2001 in #3056
Unify server-side and model-side Config (Part3) by @YuanRisheng in #3047
[Bug fix] Fix arguement error in mixed_ep when pd by @Wanglongzhi2001 in #3060
support model loading for w4a8 offline quant by @rsmallblue in #3064
[Feature] Support repetition early stop by @zeroRains in #3024
[SOT] Extend SOT warmup support to new hardware by @DrRyanHuang in #3032
修复传入max_tokens=1时的报错 by @AuferGachet in #3068
[Docs]add sampling docs by @Sunny-bot1 in #2973
[Feature]support bad_words by @Sunny-bot1 in #3055
update doc: load_balance.md by @lddfym in #3008
Unify server-side and model-side Config(Part-4) by @YuanRisheng in #3070
[Doc] add repetition early stopping doc by @zeroRains in #3078
[Feature] multi source download by @Yzc216 in #3072
support W4A8 EPLB by @rsmallblue in #3075
Add uinttest for moe_ffn_wint2. by @Xreki in #3037
delete unused unittest by @YuanRisheng in #3065
[BugFix] vl encoder tokens dtype problem by @ltd0924 in #3069
[Fix] Fix version function by @Jiang-Jia-Jun in #3076
[Feature] Multimodal Scheduler V1 by @ming1753 in #3019
w4a8 offline by @bukejiyu in #3074
adapter qwen3 moe attr for init by @zhink in #3066
Add ci for custom op approve by @YuanRisheng in #3079
Revert "Add uinttest for moe_ffn_wint2." by @chang-wenbin in #3085
[feat]support qwen3 using new loader by @bukejiyu in #3057
[feat] extra parameters are all passed directly via http payload now, or in extra_body if using openai client by @liyonghua0910 in #3058
fix the memory leak when modify qp to rts failed by @huzhida in #3051
logprob debug use by @XieYunshen in #3095
[doc] add stop_seqs doc by @zoooo0820 in #3090
[Feature] support ep in mixed mode by @ltd0924 in #3001
[BugFix]Fix ep size by @YuanRisheng in #3092
[Feature] Support include_stop_str_in_output in completion api by @Jiang-Jia-Jun in #3096
[BUG FIX] Fix bug when preempted request rescheduled by @rainyfly in #3080
[Executor] Refactor GetBlockShapeAndSplitKVBlock Kernel by @gongshaotian in #2989
add approve ci by @XieYunshen in #3093
[Doc] Release fastdeploy-xpu 2.0.3 by @iosmers in #3105
[doc] best practice for eb45 text models by @zoooo0820 in #3002
fix ci by @XieYunshen in #3106
[Bug Fix] Fix bug for offline inference in scheduler v1 by @rainyfly in #3117
[Feature] block scheduler v1 support prefix caching by @kevincheng2 in #3061
[Doc] add chat_template_kwagrs and update params docs by @LiqinruiG in #3103
fix is_permuted by @rsmallblue in #3098
Fix test_EB_Lite_serving.py by @ZhangYulongg in #3119
[BUG] Fix bug for pd in fd by @rainyfly in #3034
[Feature] General support for logprobs by @sunlei1024 in #2974
[BugFix] fix request_output sampling_params in PD by @ckl117 in #3154
[Executor]Fix get_block_shape_and_split_kv_block Kernel typo by @gongshaotian in #3153
[Bug Fix] fix pd disaggregated kv cache signal by @ltd0924 in #3173
[cherry-pick]fix load_pre_sharded_checkpoint (#3152) by @bukejiyu in #3169
[XPU]Fix out-of-memory issue during single-XPU deployment by @iosmers in #3131
[XPU] Update XPU dockerflie by @plusNew001 in #3147
Apply CI fix from Develop by @XieYunshen in #3151
[Bug Fix] Fix bug of MLA Attention Backend by @gongshaotian in #3178
[Bugfix] Fix uninitialized decoded_token and add corresponding unit test by @sunlei1024 in #3201
[BugFix] support real batch_size by @lizexu123 in #3217
[FIX 2.1]fix bad_words when sending requests consecutively by @Sunny-bot1 in #3199
[Trace]merge develop trace FD_START by @sg263 in #3253
[Cherry-pick] fix stop seq by @zoooo0820 in #3263
[BugFix] fix too many open files problem by @ltd0924 in #3275
[XPU] Revert PR 3217 by @iosmers in #3286
[BugFix] fix ep by @lizexu123 in #3290
fix scheduler bug in release2.1 by @rainyfly in #3295
Revert "[BugFix] fix ep " by @Jiang-Jia-Jun in #3317
[Bug Fix] fix uvicorn multi worker error by @kevincheng2 in #3309
fix ci pypi index error by @XieYunshen in #3327
[Docs]fix sampling docs 2.1 by @Sunny-bot1 in #3333
[Bug Fix] fix vl V1 schedule bug by @ming1753 in #3284
Fix block num in schduelr v1 for release 2.1 by @rainyfly in #3315
[Bug fix] fix bug for scheduler v0 by @rainyfly in #3306
Remove useless code release/2.1 by @Jiang-Jia-Jun in #3338
[BugFix]fix mapping by @gzy19990617 in #3322
Release/2.1 by @memoryCoderC in #3361
Completion add raw_prediction/text_after_process by @memoryCoderC in #3362
Use latest PaddlePaddle package (#3347) by @XieYunshen in #3352
Pre ce modified (#3335) by @XieYunshen in #3360
[Bug Fix] Fix V1 video bug by @ming1753 in #3387
[Cherry-pick] fix stopseq error info by @zoooo0820 in #3342
[BugFix] Fix default log level of paddleformers by @Jiang-Jia-Jun in #3377
feat(log):add_request_and_response_log by @xiaolei373 in #3392
Optimize CI execution workflow. (#3371) by @XieYunshen in #3384
[BugFix] fix control signal release failed by @ltd0924 in #3374
[XPU] Fixed the issue of performance degradation caused by enabling ENABLE_V1_KVCACHE_SCHEDULER by @iosmers in #3393
[BugFix] fix ErnieProcessor not set raw_prediction by @memoryCoderC in #3401
[Doc]Release fastdeploy-xpu 2.1.0 by @iosmers in #3407
[Doc]Release fastdeploy-xpu 2.1.0 by @iosmers in #3408

New Contributors

@ZhangHandi made their first contribution in #2673
@LiqinruiG made their first contribution in #2696
@quanxiang-liu made their first contribution in #2701
@wtmlon made their first contribution in #2715
@xiegetest made their first contribution in #2732
@celsowm made their first contribution in #2670
@lifulll made their first contribution in #2756
@0x3878f made their first contribution in #2779
@K11OntheBoat made their first contribution in #2798
@xiegegege made their first contribution in #2824
@ophilia-lee made their first contribution in #2851
@YuCosine made their first contribution in #2841
@SigureMo made their first contribution in #2771
@Xreki made their first contribution in #2990
@begin2023 made their first contribution in #3010
@AuferGachet made their first contribution in #3068
@huzhida made their first contribution in #3051

Full Changelog: v2.0.0...v2.1.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v2.1.0

使用体验优化

推理性能优化

国产硬件部署能力升级

相关文档和说明

What's Changed

New Contributors

Contributors

Uh oh!