-
Notifications
You must be signed in to change notification settings - Fork 2.7k
enable auto-round quantization model #6226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
enable auto-round quantization model #6226
Conversation
Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>
Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>
Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>
…com/WeiweiZhang1/sglang into enable_autoround_quantization_model
Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>
Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>
Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>
Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>
Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>
please kindly have a review when you are free |
Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>
Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>
Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>
Thank you for your thorough review. I've updated the code based on your comments. Any additional feedback or suggestions? @AniZpZ |
Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>
no more concerns from me |
@AniZpZ Hi, would it be possible for you to take a look at the unit test failures and help identify whether any of them are related to this PR? Thank you in advance! |
I think it is ok. please fix the lint introduced by resloving the conflicts |
Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>
Great! I've addressed the lint issue |
Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>
…able_autoround_quantization_model
Signed-off-by: Zhang, Weiwei1 <weiwei1.zhang@intel.com>
This pr is to support models quantized by AutoRound github paper,
AutoRound delivers significantly higher accuracy at extremely low bit-widths (e.g., 2-bit) and offers broader compatibility across models (LLMs and VLMs), quantization formats, and configurations. You can check out our github/paper or this blog post.
AutoRound has been integrated into vllm, pytorch/ao and Hugging Face Transformers. Several Hugging Face Spaces offer models quantized with AutoRound, including OPEA, Kaitchup, and fbaldassarri.
Known issues
Mixed bits support is limited
Mixed-bit quantization is currently limited. Since vLLM fuses layers (e.g., QKV), applying different bit-widths to components within the same fused layer can lead to incompatibility issues.
Quantized MOE model support is limited
Qwen3-30B-A3B: accuracy is close to zero, for gptq format has the 'Capture CUDA graph failed: Apply router weight on input is not supported forfused Marlin MoE method' issue, while for awq format , sym quant reports 'KeyError: 'model.layers.13.mlp.experts.w2_qzeros'', and asym also has accuracy close to zero problem.
deepseek-moe-16b-base: ’ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size‘ , Same issues are exists for awq and gptq
Quantized vlms support is limited
Qwen2.5-VL-7B : auto_round:auto_gptq format accuracy is close to zero. gptq model has the ‘The output size is not aligned with the quantized weight shape’ issue. auto_round:auto_awq and awq format are fine