vLLM v0.21.0 Release: The "Omni-Architecture" Update
Version 0.21.0 introduces significant performance breakthroughs for next-generation multimodal models, specifically focusing on the Gemma 4 and Qwen 3.6 ecosystems.
🚀 Google Gemma 4 Optimizations
Gemma 4 support is now native, moving beyond the generic GemmaForCausalLM implementation:
- Multi-Token Prediction (MTP) Support: Native integration for Gemma 4's MTP heads, allowing for up to 3x throughput increase when using the official
gemma-4-mtp draft models.
- Video-to-Token Pipeline: Integrated support for Gemma 4’s native video processing. You can now pass
video_url or video_path directly to the vllm.generate() method.
- FlashAttention-3 Integration: Gemma 4 models now leverage FP8 FlashAttention-3 by default on H100/B200 GPUs, reducing memory overhead by 40%.
- Requirement: Upgraded dependency to
transformers>=5.5.0 required for Gemma 4 weight loading.
🚀 Alibaba Qwen 3.6 Enhancements
Qwen 3.6 (including MoE and Coder variants) receives several specialized kernels:
- Gated DeltaNet Attention: Optimized kernels for Qwen 3.6’s unique attention mechanism, fixing the high-latency issues seen in the v0.20.x experimental branch.
- Reasoning Parser (
--reasoning-parser qwen3): A new server-side flag that automatically structures Qwen 3.6's "Thought" tokens into a separate metadata field in the OpenAI-compatible API response.
- MoE Load Balancing: Improved
tensor_parallel distribution specifically for Qwen3.6-35B-A3B, ensuring better utilization across multiple GPUs.
- Tool-Calling Accuracy: Fixed a bug where Qwen 3.6 coder models would occasionally hallucinate JSON delimiters in zero-shot prompts.
🛠️ General Core Changes
- Zero-Bubble Scheduling: Async scheduling now overlaps the "Model Runner" and "GPU Executor" phases, eliminating gaps between requests.
- FP8 W8A16 Support: Experimental support for 8-bit weights with 16-bit activations, significantly improving accuracy over pure FP8 for reasoning-heavy tasks.
- Breaking Change: Support for Python 3.9 is officially dropped. Python 3.11+ is now the recommended runtime.
For the full list of 350+ PRs, visit the official vLLM GitHub repository.