vLLM v0.21.0 Release Notes Summary

Version 0.21.0 introduces significant performance breakthroughs for next-generation multimodal models, specifically focusing on the Gemma 4 and Qwen 3.6 ecosystems.

🚀 Google Gemma 4 Optimizations

Gemma 4 support is now native, moving beyond the generic GemmaForCausalLM implementation:

Multi-Token Prediction (MTP) Support: Native integration for Gemma 4's MTP heads, allowing for up to 3x throughput increase when using the official gemma-4-mtp draft models.
Video-to-Token Pipeline: Integrated support for Gemma 4’s native video processing. You can now pass video_url or video_path directly to the vllm.generate() method.
FlashAttention-3 Integration: Gemma 4 models now leverage FP8 FlashAttention-3 by default on H100/B200 GPUs, reducing memory overhead by 40%.
Requirement: Upgraded dependency to transformers>=5.5.0 required for Gemma 4 weight loading.

🚀 Alibaba Qwen 3.6 Enhancements

Qwen 3.6 (including MoE and Coder variants) receives several specialized kernels:

Gated DeltaNet Attention: Optimized kernels for Qwen 3.6’s unique attention mechanism, fixing the high-latency issues seen in the v0.20.x experimental branch.
Reasoning Parser (--reasoning-parser qwen3): A new server-side flag that automatically structures Qwen 3.6's "Thought" tokens into a separate metadata field in the OpenAI-compatible API response.
MoE Load Balancing: Improved tensor_parallel distribution specifically for Qwen3.6-35B-A3B, ensuring better utilization across multiple GPUs.
Tool-Calling Accuracy: Fixed a bug where Qwen 3.6 coder models would occasionally hallucinate JSON delimiters in zero-shot prompts.

🛠️ General Core Changes

Zero-Bubble Scheduling: Async scheduling now overlaps the "Model Runner" and "GPU Executor" phases, eliminating gaps between requests.
FP8 W8A16 Support: Experimental support for 8-bit weights with 16-bit activations, significantly improving accuracy over pure FP8 for reasoning-heavy tasks.
Breaking Change: Support for Python 3.9 is officially dropped. Python 3.11+ is now the recommended runtime.

vLLM v0.21.0 Release: The "Omni-Architecture" Update

🚀 Google Gemma 4 Optimizations

🚀 Alibaba Qwen 3.6 Enhancements

🛠️ General Core Changes